To humans, the English language seems simple and straight forward. Computers, on the other hand, are like.. dude whats that, gimme some 1’s and 0’s and were good. In python, there are some very powerful packages that can allow you to easily process human languages. Doing this project has really made me appreciate the English language, however it is can be quite inefficient. You might be surprised at how few words you actually need to convey the exact same amount of information. In this tutorial I will introduce you to the ‘nltk’ Python package and show you some of the features it has. Let’s get started.

  1. First off, we are going to need to make sure we’re all using Python 3.6 or 3.7. It will work with 2.7 but who uses Python 2 anymore. amiright? Anyways, for the sake of this project, I’m going to assume everyone is on Python 3.7. You can check what version you are running by opening up the terminal and typing python3 and hitting enter. If you only have python 2, you will get an error and you will need to go download 3.7 from the Python website. If you do have Python 3, it should look something like this:
  2. Excellent, now lets pip install the nltk package. So type: sudo pip install -U nltk and let it install. For this project we won’t need to use numpy, however it is very useful when dealing with nltk data, so I would recommend getting it anyway. sudo pip install -U numpy
  3. All rightey then. Open up your favorite IDE, as always I recommend the Visual Studio Code IDE, but I also like Atom.
  4. Before you create a new file, go ahead and make a folder that you can keep all of the nltk project files in. Name it something like… ‘nltk’ lol.
  5. Now, inside the new folder create a blank Python file remembering to end it with a ‘.py’ extension.
  6. At the top of the file write:
  7. from nltk.tokenize import sent_tokenize, word_tokenize 
    from nltk.corpus import stopwords
    import nltk
  8. Save and run that just to make sure its all working as expected. You should see a ‘DeprecationWarning’ but don’t worry about it. Should look like this:
  9. Sweet, now lets write some sample text and try tokenizing it, just to get our feet wet. Type:
  10. test_text = "Hello, my name is Kyle, what is yours? Do you like to eat glue too? I hear Amazon stock is about to go up, buy now!"
    
    tokenized_words = word_tokenize(test_text)
    tokenized_sentence = sent_tokenize(test_text)
    print(tokenized_words)
    print(tokenized_sentence)

    -Tokenizing is the process of breaking longs strings of text down into smaller parts, or ‘tokens.’ You can break text down into paragraphs, sentences, words, or characters. It just depends on what you are doing with it.

  11. If you run the code, it should look like this:
  12. -You can see how nltk breaks down the text. Usually with word-tokenizing, it will break text down according to spaces and punctuation. With sentence-tokenizing it uses ‘end-of-sentence’ punctuation (., ?, !)

  13. That’s pretty cool.. but I think we can do better. Ntlk can categorize each word and label it with its appropriate grammatical type. This is very handy in situations where you might be sifting through lots of text to find the number of times certain topics were mentioned to determine popularity. Lets categorize our sample text:
  14. #I'm going to put this in a function but you don't have to
    def categorize_text(text):
        txt = word_tokenize(text)
        tags = nltk.pos_tag(txt)
        return tags
    
    tagged = categorize_text(test_text)
    print(tagged)

    Here are a few of the tags:
    CC: conjunction, coordinating
    CD: numeral, cardinal
    IN: preposition or conjunction, subordinating
    JJ: adjective or numeral, ordinal
    JJR: adjective, comparative
    NN: noun, common, singular or mass
    NNP: noun, proper, singular
    NNS: noun, common, plural
    PRP: pronoun, personal
    PRP$: pronoun, possessive
    RB: adverb
    RBS: adverb, superlative
    UH: interjection
    VB: verb, base form
    WP: WH-pronoun
    WRB: Wh-adverb

  15. When you run the code now, you will see each word in the sample text with a corresponding tag next to it. It should look like this:
  16. How cool is that?! yea, pretty cool, I know. Ok, moving on. Now since we have these new tools in our tool belt, lets actually put them to use on a real text file. How about the ~1200 page book ‘War and Peace’? Sounds good to me. Click War and Peace text to view the .txt file. When you get to the page, right click and ‘save page as,’ give it a name, and save it in the same folder that your project file is located.
  17. Now in your project file we need to import the file and store its contents in a variable. So type this:
  18. def get_words_from_file(file):
        try:
            f = open(file, "r")
            # f_words = [word.split(" ") for word in f.readlines()]
            f_words = []
            for word in f.readlines():
                f_words.append(word.split(" "))
            f.close()
            return f_words
        except Exception:
            print("no file")
    
    file_words = get_words_from_file("/Users/kylesupple/Desktop/Programming/nltk_code/proj_2_nltk/warAndPeace_copy.txt")
    print(file_words)
  19. Alright, now I’m going to talk about something called ‘stopwords.’ These are words that typically indicate that no more ‘useful’ information will be added to the sentence and that the computer should not analyze the rest of the sentence because it is either sarcastic, negated, or unrelated/unimportant. Here is the list of words Nltk categorizes as stopwords:
  20. Lets test this out on our small sample text before we move onto to full .txt file. So to do this I’m going to use two separate functions. One to calculate the number of total words in a string, and one to calculate the number of words after the stop words have been removed.
  21. #this imports the list of stopwords we can reference from
    stop_words = set(stopwords.words("english")) 
    
    def get_filtered_words(text):
        filt_words = []
        for line in text:
            for w in line:
                if w not in stop_words:
                    filt_words.append(w)
        return filt_words
    
    def get_unfiltered_words(text):
        un_filt_words = []
        for line in file_words:
            for w in line:
                un_filt_words.append(w)
        return un_filt_words
    
    print("Unfiltered: " + str(len(get_unfiltered_words(test_text))))
    print("Filtered: " + str(len(get_filtered_words(test_text))))
  22. With this we can compare the entire text of War and Peace and the filtered text of War and Peace that has much fewer words, but should still convey the same amount of information. So now we will just replace the ‘test_text’ with the ‘file_words’ we imported earlier:
  23. print("Unfiltered: " + str(len(get_unfiltered_words(file_words))))
    print("Filtered: " + str(len(get_filtered_words(file_words))))
  24. When you run this it should look similar to this:
  25. Pretty cool huh? That cut the size down by about a 1/3rd, crazy! Anyways that’s pretty much it for this tutorial. I’m planning on putting one out that will show you a visual map of the words. Keep an eye out for it.
  26. Here’s the full code. TLDR:
  27. from nltk.tokenize import sent_tokenize, word_tokenize
    from nltk.corpus import stopwords
    import nltk
    
    
    test_text = "Hello, my name is Kyle, what is yours? Do you like to eat glue too? I hear Amazon stock is about to go up, buy now!"
    
    #tokenize by word
    tokenized_words = word_tokenize(test_text)
    #tokenize by sentence
    tokenized_sentence = sent_tokenize(test_text)
    #import stopwords
    stop_words = set(stopwords.words("english"))
    
    #function to import a text file
    def get_words_from_file(file):
        try:
            f = open(file, "r")
            f_words = []
            for word in f.readlines():
                f_words.append(word.split(" "))
            f.close()
            return f_words
        except Exception:
            print("no file")
    
    #return the words and their types
    def categorize_text(text):
        txt = word_tokenize(text)
        tags = nltk.pos_tag(txt)
        return tags
    
    #return text with stopwords removed
    def get_filtered_words(text):
        filt_words = []
        for line in text:
            for w in line:
                if w not in stop_words:
                    filt_words.append(w)
        return filt_words
    
    #return plain text
    def get_unfiltered_words(text):
        un_filt_words = []
        for line in file_words:
            for w in line:
                un_filt_words.append(w)
        return un_filt_words
    
    
    file_words = get_words_from_file("/Users/kylesupple/Desktop/Programming/nltk_code/proj_2_nltk/warAndPeace_copy.txt")
    
    print("Unfiltered: " + str(len(get_unfiltered_words(file_words))))
    print("Filtered: " + str(len(get_filtered_words(file_words))))