Hey there! If you came here from the first NLP tutorial, then you’ll fit right in. If not, then you might want to go check out Natural Language Processing with Python Part 1 before continuing here. Either way, lets get started.
Previously we learned about the nltk Python package and became familiar with some of the cool features it comes with. We left off with being able to remove stopwords from text and labeling words with their corresponding parts of speech. Today I want to talk about comparing words. So leggo..
This project assumes you are relatively familiar with Python. At least enough that you can setup a file directory, create a Python file, and navigate to it using the terminal. If you are not, then check out my Python “Hello World” tutorial.
- So Lets start off with something fun, comparing words. It’s Weird to think that two words that seem completely different from each other actually have some sort of connection. The Wu-Palmer Similarity score is a value that is derived from comparing the hierarchical structure of two words from an ontology like WordNet which has statistics on their actual usage in text, which, it self, is derived from a large corpus.
- Alright, enough with the long words already. If you are interested in the process used to get this comparison score, check out this paper. For now though, open up a new python file and at the top, go ahead and write:
- Cool. So I guess first of all, we need some words to compare. How about we let the user enter two words. Sound good?.. Good I’m glad we all agree. Lets get some input:
- The type of word does matter. In this project we will only be comparing nouns to nouns, or verbs to verbs and so on. It is possible to compare different types, but we’ll save that for another day. Now type:
- Synsets, or synonym sets, are sets of synonyms that share a common meaning. They are used to obtain a more general meaning of a word, making comparing them easier. So now we need to create two synsets, one for each word we want to compare. Type this:
- Almost done, now we just need to calculate the score using the Wu-Palmer Similarity function. Luckily, the nltk package has a method built in that lets us do just that. Now we need to call that method:
- Awesome! Lets print out the result and ‘let-er-rip’:
- The output should look something like this: The results can be pretty funny. I mean, who knew the word ‘lettuce’ was 25% similar to the word ‘paint’? lol
- Here’s the full code, TLDR:
WordNet is a lexical database for the English language.[1] It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus.(wordnet wiki)
from nltk.corpus import wordnet
This will import the WordNet database which we will pull from.
word1 = input("Enter a word: ")
word2 = input("Enter a word to compare it with: ")
input_word_1 = word1 + ".n.01"
input_word_2 = word2 + ".n.01"
The ‘n’ tells the program that we want to compare two nouns. The ‘.01’ means we want to take the first word from the synset.
w1 = wordnet.synset(input_word_1)
w2 = wordnet.synset(input_word_2)
sim = w1.wup_similarity(w2) * 100
print(("A " + word1 + " is {:3.2f} percent similar to a " + word2).format(sim))
from nltk.corpus import wordnet
word1 = input("Enter a noun: ")
word2 = input("Enter a noun to compare it with: ")
input_word_1 = word1 + ".n.01"
input_word_2 = word2 + ".n.01"
#synset
syns = wordnet.synsets("program")
w1 = wordnet.synset(input_word_1)
w2 = wordnet.synset(input_word_2)
sim = w1.wup_similarity(w2) * 100
print(("A " + word1 + " is {:3.2f} percent similar to a " + word2).format(sim))