WordNet Interface

1 Words

A Word is an index into the database. More specifically, a list of the Senses of the supplied word string. These senses can be accessed via index notation word[n] or via the word.getSenses() method.

>>> from nltk.wordnet import *

>>> N['dog']
dog (noun)
>>> N.word('dog')
dog (noun)
>>> N['dog'].pos
'noun'
>>> N['dog'].form
'dog'
>>> N['dog'].taggedSenseCount
1
>>> N['dog'].synsets()
[{noun: dog, domestic_dog, Canis_familiaris}, {noun: frump, dog}, {noun: dog}, {noun: cad, bounder, blackguard, dog, hound, heel}, {noun: frank, frankfurter, hotdog, hot_dog, dog, wiener, wienerwurst, weenie}, {noun: pawl, detent, click, dog}, {noun: andiron, firedog, dog, dog-iron}]
>>> N['dog'].isTagged()
True
>>> str(N['dog'])
'dog (noun)'

2 Synsets

Synset: a set of synonyms that share a common meaning.

Each synset contains one or more lemmas, which represent a specific sense of a specific word. Senses can be retrieved via synset.senses() or through the index notations synset[0], synset[string], or synset[word]. Synsets are related to other synsets, and we can get a dictionary of relations using the relations() function.

>>> V['think'][0].verbFrames
(5, 9)

>>> V['think'][0].verbFrameStrings
['Something think something Adjective/Noun', 'Somebody think somebody']

>>> N['dog'][0]
{noun: dog, domestic_dog, Canis_familiaris}

>>> N['dog'][0].relations()
{'hypernym': [{noun: canine, canid}, {noun: domestic_animal, domesticated_animal}], 'part holonym': [{noun: flag}], 'member meronym': [{noun: Canis, genus_Canis}, {noun: pack}], 'hyponym': [{noun: puppy}, {noun: pooch, doggie, doggy, barker, bow-wow}, {noun: cur, mongrel, mutt}, {noun: lapdog}, {noun: toy_dog, toy}, {noun: hunting_dog}, {noun: working_dog}, {noun: dalmatian, coach_dog, carriage_dog}, {noun: basenji}, {noun: pug, pug-dog}, {noun: Leonberg}, {noun: Newfoundland, Newfoundland_dog}, {noun: Great_Pyrenees}, {noun: spitz}, {noun: griffon, Brussels_griffon, Belgian_griffon}, {noun: corgi, Welsh_corgi}, {noun: poodle, poodle_dog}, {noun: Mexican_hairless}]}

>>> N['dog'][0].relations()[HYPERNYM]
[{noun: canine, canid}, {noun: domestic_animal, domesticated_animal}]

>>> N['dog'][0][HYPERNYM]
[{noun: canine, canid}, {noun: domestic_animal, domesticated_animal}]

>>> len(N['dog'][0])
3
>>> N['cat'][6]
{noun: big_cat, cat}

hypernyms(self): Get the set of parent hypernym synsets of this synset.

closure(HYPERNYM): Get the path(s) from this synset to the root, where each path is a list of the synset nodes traversed on the way to the root.

hypernym_distances(self, distance): Get the path(s) from this synset to the root, counting the distance of each node from the initial node on the way. A list of (synset, distance) tuples is returned.

shortest_path_distance(self, other_synset): Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, -1 is returned. If a node is compared with itself 0 is returned.

getIC(self, freq_data): Get the Information Content (IC) value of this Synset, using the supplied dict 'freq_data'.

3 Word Senses

A pairing between a word and a sense.

>>> w = word('eat', 'v')
>>> w.senseCounts()
[61, 13, 4, None, None, None]
>>> w[2].words
['feed', 'eat']
>>> s = w[2].wordSense('eat')
>>> s
eat (verb) 3
>>> s.count()
4
>>> k = s.senseKey
>>> k
'eat%2:34:02::'
>>> WordSense(k).synset()
{verb: feed, eat}

4 Similarity

path_similarity(self, other_sense): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case -1 is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

>>> N['poodle'][0].path_similarity(N['dalmatian'][1])
0.33333333333333331

>>> N['dog'][0].path_similarity(N['cat'][0])
0.20000000000000001

>>> V['run'][0].path_similarity(V['walk'][0])
0.25

>>> V['run'][0].path_similarity(V['think'][0])
-1

lch_similarity(self, other_sense): Leacock-Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth.

>>> N['poodle'][0].lch_similarity(N['dalmatian'][1])
2.5389738710582761

>>> N['dog'][0].lch_similarity(N['cat'][0])
2.0281482472922856

>>> V['run'][0].lch_similarity(V['walk'][0])
1.8718021769015913

>>> V['run'][0].lch_similarity(V['think'][0])
-1

wup_similarity(self, other_sense): Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Note that at this time the scores given do _not_ always agree with those given by Pedersen's Perl implementation of Wordnet Similarity.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

>>> N['poodle'][0].wup_similarity(N['dalmatian'][1])
0.93333333333333335

>>> N['dog'][0].wup_similarity(N['cat'][0])
0.8571428571428571

>>> V['run'][0].wup_similarity(V['walk'][0])
0.5714285714285714

>>> V['run'][0].wup_similarity(V['think'][0])
-1

information_content(icfile): Information Content: Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.

res_similarity(self, other_sense, ic): Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

>>> ic = nltk.wordnet.load_ic('ic-bnc-resnik.dat')
>>> furnace = nltk.wordnet.N['furnace'][0]
>>> stove = nltk.wordnet.N['stove'][0]
>>> furnace.res_similarity(stove, ic) 
2.52024528...

jcn_similarity(self, other_sense, ic): Jiang-Conrath Similarity Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

>>> ic = nltk.wordnet.load_ic('ic-brown.dat')
>>> furnace.jcn_similarity(stove, ic) 
0.0640910491...

lin_similarity(self, other_sense, ic): Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

>>> ic = nltk.wordnet.load_ic('ic-semcor.dat')
>>> furnace.lin_similarity(stove, ic) 
0.229384690...

word(form, pos=NOUN): Return a word with the given lexical form and pos.

sense(form, pos=NOUN, senseno=0): Lookup a sense by its sense number. Used by repr(sense).

synset(pos, offset): Lookup a synset by its offset. Used by repr(synset).

5 N, V, ADJ and ADV Dictionaries

Dictionary classes, which allow users to access Wordnet data via a handy dict notation (see below). Also defined are the low level _IndexFile class and various file utilities, which do the actual lookups in the Wordnet database files.

Dictionary: A Dictionary contains all the Words in a given part of speech. Four dictionaries, bound to N, V, ADJ, and ADV, are bound by default in __init.py__.

Indexing a dictionary by a string retrieves the word named by that string, e.g. dict['dog']. Indexing by an integer n retrieves the nth word, e.g. dict[0]. Access by an arbitrary integer is very slow except in the special case where the words are accessed sequentially; this is to support the use of dictionaries as the range of a for statement and as the sequence argument to map and filter.

>>> N.pos
'noun'
>>> N['dog']
dog (noun)
>>> N['inu']
Traceback (most recent call last):
   ...
KeyError: "'inu' is not in the 'noun' database"

If index is a String, return the Word whose form is index. If index is an integer n, return the Word indexed by the n'th Word in the Index file.

>>> N['dog']
dog (noun)
>>> N[0]
'hood (noun)

get(self, key, default=None): Return the Word whose form is key, or default.

>>> N.get('dog')
dog (noun)
>>> N.get('inu')

has_key(self, form): Checks if the supplied argument is an index into this dictionary.

>>> N.has_key('dog')
True
>>> N.has_key('inu')
False

6 Regression Tests

Bug 1796793: WordNet verbFrameStrings

>>> w = nltk.wordnet.V['fly']
>>> s = w.synsets()[0]
>>> s.verbFrameStrings
['Something fly', 'Somebody fly', 'Something is flying PP', 'Somebody fly PP']

Bug 1940398 UnboundLocalError in wordnet.util line 426

>>> from nltk.wordnet import util
>>> util.getIndex('good-for-nothing')
good-for-nothing (noun)
>>> util.getIndex('good for nothing')
good-for-nothing (noun)

Bug: __hash__ method not properly implemented for Synset, Word, WordSense

>>> from nltk.wordnet import N
>>> word1 = N['kin']
>>> word2 = N['kin']
>>> hash(word1) == hash(word2)
True
>>> syn1 = word1.synsets()[0]
>>> syn2 = word2.synsets()[0]
>>> hash(syn1) == hash(syn2)
True
>>> sense1 = syn1.wordSenses[0]
>>> sense2 = syn2.wordSenses[0]
>>> hash(sense1) == hash(sense2)
True