To humans, the English language seems simple and straight forward. Computers, on the other hand, are like.. dude whats that, gimme some 1’s and 0’s and were good. In python, there are some very powerful packages that can allow you to easily process human languages. Doing this project has really made me appreciate the English language, however it is can be quite inefficient. You might be surprised at how few words you actually need to convey the exact same amount of information. In this tutorial I will introduce you to the ‘nltk’ Python package and show you some of the features it has. Let’s get started.
- First off, we are going to need to make sure we’re all using Python 3.6 or 3.7. It will work with 2.7 but who uses Python 2 anymore. amiright? Anyways, for the sake of this project, I’m going to assume everyone is on Python 3.7. You can check what version you are running by opening up the terminal and typing
python3
and hitting enter. If you only have python 2, you will get an error and you will need to go download 3.7 from the Python website. If you do have Python 3, it should look something like this: - Excellent, now lets pip install the nltk package. So type:
sudo pip install -U nltk
and let it install. For this project we won’t need to use numpy, however it is very useful when dealing with nltk data, so I would recommend getting it anyway.sudo pip install -U numpy
- All rightey then. Open up your favorite IDE, as always I recommend the Visual Studio Code IDE, but I also like Atom.
- Before you create a new file, go ahead and make a folder that you can keep all of the nltk project files in. Name it something like… ‘nltk’ lol.
- Now, inside the new folder create a blank Python file remembering to end it with a ‘.py’ extension.
- At the top of the file write:
- Save and run that just to make sure its all working as expected. You should see a ‘DeprecationWarning’ but don’t worry about it. Should look like this:
- Sweet, now lets write some sample text and try tokenizing it, just to get our feet wet. Type:
- If you run the code, it should look like this:
- That’s pretty cool.. but I think we can do better. Ntlk can categorize each word and label it with its appropriate grammatical type. This is very handy in situations where you might be sifting through lots of text to find the number of times certain topics were mentioned to determine popularity. Lets categorize our sample text:
- When you run the code now, you will see each word in the sample text with a corresponding tag next to it. It should look like this:
- How cool is that?! yea, pretty cool, I know. Ok, moving on. Now since we have these new tools in our tool belt, lets actually put them to use on a real text file. How about the ~1200 page book ‘War and Peace’? Sounds good to me. Click War and Peace text to view the .txt file. When you get to the page, right click and ‘save page as,’ give it a name, and save it in the same folder that your project file is located.
- Now in your project file we need to import the file and store its contents in a variable. So type this:
- Alright, now I’m going to talk about something called ‘stopwords.’ These are words that typically indicate that no more ‘useful’ information will be added to the sentence and that the computer should not analyze the rest of the sentence because it is either sarcastic, negated, or unrelated/unimportant. Here is the list of words Nltk categorizes as stopwords:
- Lets test this out on our small sample text before we move onto to full .txt file. So to do this I’m going to use two separate functions. One to calculate the number of total words in a string, and one to calculate the number of words after the stop words have been removed.
- With this we can compare the entire text of War and Peace and the filtered text of War and Peace that has much fewer words, but should still convey the same amount of information. So now we will just replace the ‘test_text’ with the ‘file_words’ we imported earlier:
- When you run this it should look similar to this:
- Pretty cool huh? That cut the size down by about a 1/3rd, crazy! Anyways that’s pretty much it for this tutorial. I’m planning on putting one out that will show you a visual map of the words. Keep an eye out for it.
- Here’s the full code. TLDR:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import nltk
test_text = "Hello, my name is Kyle, what is yours? Do you like to eat glue too? I hear Amazon stock is about to go up, buy now!"
tokenized_words = word_tokenize(test_text)
tokenized_sentence = sent_tokenize(test_text)
print(tokenized_words)
print(tokenized_sentence)
-Tokenizing is the process of breaking longs strings of text down into smaller parts, or ‘tokens.’ You can break text down into paragraphs, sentences, words, or characters. It just depends on what you are doing with it.
-You can see how nltk breaks down the text. Usually with word-tokenizing, it will break text down according to spaces and punctuation. With sentence-tokenizing it uses ‘end-of-sentence’ punctuation (., ?, !)
#I'm going to put this in a function but you don't have to
def categorize_text(text):
txt = word_tokenize(text)
tags = nltk.pos_tag(txt)
return tags
tagged = categorize_text(test_text)
print(tagged)
Here are a few of the tags:
CC: conjunction, coordinating
CD: numeral, cardinal
IN: preposition or conjunction, subordinating
JJ: adjective or numeral, ordinal
JJR: adjective, comparative
NN: noun, common, singular or mass
NNP: noun, proper, singular
NNS: noun, common, plural
PRP: pronoun, personal
PRP$: pronoun, possessive
RB: adverb
RBS: adverb, superlative
UH: interjection
VB: verb, base form
WP: WH-pronoun
WRB: Wh-adverb
def get_words_from_file(file):
try:
f = open(file, "r")
# f_words = [word.split(" ") for word in f.readlines()]
f_words = []
for word in f.readlines():
f_words.append(word.split(" "))
f.close()
return f_words
except Exception:
print("no file")
file_words = get_words_from_file("/Users/kylesupple/Desktop/Programming/nltk_code/proj_2_nltk/warAndPeace_copy.txt")
print(file_words)
#this imports the list of stopwords we can reference from
stop_words = set(stopwords.words("english"))
def get_filtered_words(text):
filt_words = []
for line in text:
for w in line:
if w not in stop_words:
filt_words.append(w)
return filt_words
def get_unfiltered_words(text):
un_filt_words = []
for line in file_words:
for w in line:
un_filt_words.append(w)
return un_filt_words
print("Unfiltered: " + str(len(get_unfiltered_words(test_text))))
print("Filtered: " + str(len(get_filtered_words(test_text))))
print("Unfiltered: " + str(len(get_unfiltered_words(file_words))))
print("Filtered: " + str(len(get_filtered_words(file_words))))
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import nltk
test_text = "Hello, my name is Kyle, what is yours? Do you like to eat glue too? I hear Amazon stock is about to go up, buy now!"
#tokenize by word
tokenized_words = word_tokenize(test_text)
#tokenize by sentence
tokenized_sentence = sent_tokenize(test_text)
#import stopwords
stop_words = set(stopwords.words("english"))
#function to import a text file
def get_words_from_file(file):
try:
f = open(file, "r")
f_words = []
for word in f.readlines():
f_words.append(word.split(" "))
f.close()
return f_words
except Exception:
print("no file")
#return the words and their types
def categorize_text(text):
txt = word_tokenize(text)
tags = nltk.pos_tag(txt)
return tags
#return text with stopwords removed
def get_filtered_words(text):
filt_words = []
for line in text:
for w in line:
if w not in stop_words:
filt_words.append(w)
return filt_words
#return plain text
def get_unfiltered_words(text):
un_filt_words = []
for line in file_words:
for w in line:
un_filt_words.append(w)
return un_filt_words
file_words = get_words_from_file("/Users/kylesupple/Desktop/Programming/nltk_code/proj_2_nltk/warAndPeace_copy.txt")
print("Unfiltered: " + str(len(get_unfiltered_words(file_words))))
print("Filtered: " + str(len(get_filtered_words(file_words))))