Hey there! If you came here from the first NLP tutorial, then you’ll fit right in. If not, then you might want to go check out Natural Language Processing with Python Part 1 before continuing here. Either way, lets get started.
Previously we learned about the nltk Python package and became familiar with some of the cool features it comes with. We left off with being able to remove stopwords from text and labeling words with their corresponding parts of speech. Today I want to talk about comparing words. So leggo..

This project assumes you are relatively familiar with Python. At least enough that you can setup a file directory, create a Python file, and navigate to it using the terminal. If you are not, then check out my Python “Hello World” tutorial.

  1. So Lets start off with something fun, comparing words. It’s Weird to think that two words that seem completely different from each other actually have some sort of connection. The Wu-Palmer Similarity score is a value that is derived from comparing the hierarchical structure of two words from an ontology like WordNet which has statistics on their actual usage in text, which, it self, is derived from a large corpus.
  2. WordNet is a lexical database for the English language.[1] It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus.(wordnet wiki)

  3. Alright, enough with the long words already. If you are interested in the process used to get this comparison score, check out this paper. For now though, open up a new python file and at the top, go ahead and write:
  4. from nltk.corpus import wordnet

    This will import the WordNet database which we will pull from.

  5. Cool. So I guess first of all, we need some words to compare. How about we let the user enter two words. Sound good?.. Good I’m glad we all agree. Lets get some input:
  6. word1 = input("Enter a word: ")
    word2 = input("Enter a word to compare it with: ")
  7. The type of word does matter. In this project we will only be comparing nouns to nouns, or verbs to verbs and so on. It is possible to compare different types, but we’ll save that for another day. Now type:
  8. input_word_1 = word1 + ".n.01"
    input_word_2 = word2 + ".n.01"

    The ‘n’ tells the program that we want to compare two nouns. The ‘.01’ means we want to take the first word from the synset.

  9. Synsets, or synonym sets, are sets of synonyms that share a common meaning. They are used to obtain a more general meaning of a word, making comparing them easier. So now we need to create two synsets, one for each word we want to compare. Type this:
  10. w1 = wordnet.synset(input_word_1)
    w2 = wordnet.synset(input_word_2)
  11. Almost done, now we just need to calculate the score using the Wu-Palmer Similarity function. Luckily, the nltk package has a method built in that lets us do just that. Now we need to call that method:
  12. sim = w1.wup_similarity(w2) * 100
  13. Awesome! Lets print out the result and ‘let-er-rip’:
  14. print(("A " + word1 + " is {:3.2f} percent similar to a " + word2).format(sim))
  15. The output should look something like this: Screen Shot 2018-08-18 at 10.38.23 PM The results can be pretty funny. I mean, who knew the word ‘lettuce’ was 25% similar to the word ‘paint’? lol
  16. Here’s the full code, TLDR:
  17. from nltk.corpus import wordnet
    
    
    word1 = input("Enter a noun: ")
    word2 = input("Enter a noun to compare it with: ")
    
    input_word_1 = word1 + ".n.01"
    input_word_2 = word2 + ".n.01"
    
    #synset
    syns = wordnet.synsets("program")
    w1 = wordnet.synset(input_word_1)
    w2 = wordnet.synset(input_word_2)
    
    sim = w1.wup_similarity(w2) * 100
    print(("A " + word1 + " is {:3.2f} percent similar to a " + word2).format(sim))