Package nltk :: Package wordnet :: Module similarity

Module similarity

Functions

path_similarity(synset1, synset2, verbose=False)
Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy.

source code

lch_similarity(synset1, synset2, verbose=False)
Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur.

source code

wup_similarity(synset1, synset2, verbose=False)
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

source code

res_similarity(synset1, synset2, ic, verbose=False)
Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

source code

jcn_similarity(synset1, synset2, ic, verbose=False)
Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.

source code

lin_similarity(synset1, synset2, ic, verbose=False)
Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.

source code

common_hypernyms(synset1, synset2)
Find all synsets that are hypernyms of both input synsets.

source code

_lcs_by_depth(synset1, synset2, verbose=False)
Finds the least common subsumer of two synsets in a Wordnet taxonomy, where the least common subsumer is defined as the ancestor node common to both input synsets whose shortest path to the root node is the longest.

source code

_lcs_ic(synset1, synset2, ic, verbose=False)
Get the information content of the least common subsumer that has the highest information content value.

source code

information_content(synset, ic)

source code

load_ic(icfile)
Load an information content file from the wordnet_ic corpus and return a dictionary.

source code

_get_pos(field)

source code

Variables

[hide private]

abbreviations = 'adverb adv adv. r'

pos = 'adv'

token = 'r'

tokens = ['adverb', 'adv', 'adv.', 'r']

Function Details

[hide private]

path_similarity(synset1, synset2, verbose=False)

source code

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case -1 is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:

synset2 (Synset) - The Synset that this Synset is being compared to.

Returns:

A score denoting the similarity of the two Synsets, normally between 0 and 1. -1 is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

lch_similarity(synset1, synset2, verbose=False)

source code

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:

synset2 (Synset) - The Synset that this Synset is being compared to.

Returns:

A score denoting the similarity of the two Synsets, normally greater than 0. -1 is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

wup_similarity(synset1, synset2, verbose=False)

source code

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Note that at this time the scores given do _not_ always agree with those given by Pedersen's Perl implementation of Wordnet Similarity.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:

synset2 (Synset) - The Synset that this Synset is being compared to.

Returns:

A float score denoting the similarity of the two Synsets, normally greater than zero. If no connecting path between the two senses can be found, -1 is returned.

res_similarity(synset1, synset2, ic, verbose=False)

source code

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:

synset1 (Synset) - The first synset being compared
synset2 (Synset) - The second synset being compared
ic (dict) - an information content object (as returned by load_ic()).

Returns:

A float score denoting the similarity of the two Synsets. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N['dog'][0] and N['table'][0]). If no path exists between the two synsets a score of -1 is returned.

jcn_similarity(synset1, synset2, ic, verbose=False)

source code

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:

synset1 (Synset) - The first synset being compared
synset2 (Synset) - The second synset being compared
ic (dict) - an information content object (as returned by load_ic()).

Returns:

A float score denoting the similarity of the two Synsets. If no path exists between the two synsets a score of -1 is returned.

lin_similarity(synset1, synset2, ic, verbose=False)

source code

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:

synset1 (Synset) - The first synset being compared
synset2 (Synset) - The second synset being compared
ic (dict) - an information content object (as returned by load_ic()).

Returns:

A float score denoting the similarity of the two Synsets, in the range 0 to 1. If no path exists between the two synsets a score of -1 is returned.

common_hypernyms(synset1, synset2)

source code

Find all synsets that are hypernyms of both input synsets.

Parameters:

synset1 (Synset) - First input synset.
synset2 (Synset) - Second input synset.

Returns:

The synsets that are hypernyms of both synset1 and synset2.

_lcs_by_depth(synset1, synset2, verbose=False)

source code

Finds the least common subsumer of two synsets in a Wordnet taxonomy, where the least common subsumer is defined as the ancestor node common to both input synsets whose shortest path to the root node is the longest.

Parameters:

synset1 (Synset) - First input synset.
synset2 (Synset) - Second input synset.

Returns:

The ancestor synset common to both input synsets which is also the LCS.

_lcs_ic(synset1, synset2, ic, verbose=False)

source code

Get the information content of the least common subsumer that has the highest information content value.

Parameters:

synset1 (Synset) - First input synset.
synset2 (Synset) - Second input synset.
ic (dict) - an information content object (as returned by load_ic()).

Returns:

The information content of the two synsets and their most informative subsumer

load_ic(icfile)

source code

Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.

Parameters:

icfile (str) - The name of the wordnet_ic file (e.g. "ic-brown.dat")

Returns:

An information content dictionary