nltk.metrics package

Submodules

nltk.metrics.agreement module

Implementations of inter-annotator agreement coefficients surveyed by Artstein and Poesio (2007), Inter-Coder Agreement for Computational Linguistics.

An agreement coefficient calculates the amount that annotators agreed on label assignments beyond what is expected by chance.

In defining the AnnotationTask class, we use naming conventions similar to the paper’s terminology. There are three types of objects in an annotation task:

the coders (variables “c” and “C”) the items to be annotated (variables “i” and “I”) the potential categories to be assigned (variables “k” and “K”)

Additionally, it is often the case that we don’t want to treat two different labels as complete disagreement, and so the AnnotationTask constructor can also take a distance metric as a final argument. Distance metrics are simply functions that take two arguments, and return a value between 0.0 and 1.0 indicating the distance between them. If not supplied, the default is binary comparison between the arguments.

The simplest way to initialize an AnnotationTask is with a list of triples, each containing a coder’s assignment for one object in the task:

task = AnnotationTask(data=[(‘c1’, ‘1’, ‘v1’),(‘c2’, ‘1’, ‘v1’),...])

Note that the data list needs to contain the same number of triples for each individual coder, containing category values for the same set of items.

Alpha (Krippendorff 1980) Kappa (Cohen 1960) S (Bennet, Albert and Goldstein 1954) Pi (Scott 1955)

TODO: Describe handling of multiple coders and missing data

Expected results from the Artstein and Poesio survey paper:

>>> from nltk.metrics.agreement import AnnotationTask
>>> import os.path
>>> t = AnnotationTask(data=[x.split() for x in open(os.path.join(os.path.dirname(__file__), "artstein_poesio_example.txt"))])
>>> t.avg_Ao()
0.88
>>> t.pi()
0.7995322418977615...
>>> t.S()
0.8199999999999998...

This would have returned a wrong value (0.0) in @785fb79 as coders are in the wrong order. Subsequently, all values for pi(), S(), and kappa() would have been wrong as they are computed with avg_Ao(). >>> t2 = AnnotationTask(data=[(‘b’,‘1’,’stat’),(‘a’,‘1’,’stat’)]) >>> t2.avg_Ao() 1.0

The following, of course, also works. >>> t3 = AnnotationTask(data=[(‘a’,‘1’,’othr’),(‘b’,‘1’,’othr’)]) >>> t3.avg_Ao() 1.0

class nltk.metrics.agreement.AnnotationTask(data=None, distance=<function binary_distance>)[source]

Bases: object

Represents an annotation task, i.e. people assign labels to items.

Notation tries to match notation in Artstein and Poesio (2007).

In general, coders and items can be represented as any hashable object. Integers, for example, are fine, though strings are more readable. Labels must support the distance functions applied to them, so e.g. a string-edit-distance makes no sense if your labels are integers, whereas interval distance needs numeric values. A notable case of this is the MASI metric, which requires Python sets.

Ae_kappa(cA, cB)[source]
Ao(cA, cB)[source]

Observed agreement between two coders on all items.

Do_Kw(max_distance=1.0)[source]

Averaged over all labelers

Do_Kw_pairwise(cA, cB, max_distance=1.0)[source]

The observed disagreement for the weighted kappa coefficient.

Do_alpha()[source]

The observed disagreement for the alpha coefficient.

The alpha coefficient, unlike the other metrics, uses this rather than observed agreement.

N(*args, **kwargs)

Implements the “n-notation” used in Artstein and Poesio (2007)

@deprecated: Use Nk, Nik or Nck instead

Nck(c, k)[source]
Nik(i, k)[source]
Nk(k)[source]
S()[source]

Bennett, Albert and Goldstein 1954

agr(cA, cB, i, data=None)[source]

Agreement between two coders on a given item

alpha()[source]

Krippendorff 1980

avg_Ao()[source]

Average observed agreement across all coders and items.

kappa()[source]

Cohen 1960 Averages naively over kappas for each coder pair.

kappa_pairwise(cA, cB)[source]
load_array(array)[source]

Load an sequence of annotation results, appending to any data already loaded.

The argument is a sequence of 3-tuples, each representing a coder’s labeling of an item:
(coder,item,label)
multi_kappa()[source]

Davies and Fleiss 1982 Averages over observed and expected agreements for each coder pair.

pi()[source]

Scott 1955; here, multi-pi. Equivalent to K from Siegel and Castellan (1988).

unicode_repr

Return repr(self).

weighted_kappa(max_distance=1.0)[source]

Cohen 1968

weighted_kappa_pairwise(cA, cB, max_distance=1.0)[source]

Cohen 1968

nltk.metrics.aline module

ALINE http://webdocs.cs.ualberta.ca/~kondrak/ Copyright 2002 by Grzegorz Kondrak.

ALINE is an algorithm for aligning phonetic sequences, described in [1]. This module is a port of Kondrak’s (2002) ALINE. It provides functions for phonetic sequence alignment and similarity analysis. These are useful in historical linguistics, sociolinguistics and synchronic phonology.

ALINE has parameters that can be tuned for desired output. These parameters are: - C_skip, C_sub, C_exp, C_vwl - Salience weights - Segmental features

In this implementation, some parameters have been changed from their default values as described in [1], in order to replicate published results. All changes are noted in comments.

Example usage

# Get optimal alignment of two phonetic sequences

>>> align('θin', 'tenwis') 
[[('θ', 't'), ('i', 'e'), ('n', 'n'), ('-', 'w'), ('-', 'i'), ('-', 's')]]

[1] G. Kondrak. Algorithms for Language Reconstruction. PhD dissertation, University of Toronto.

nltk.metrics.aline.R(p, q)[source]

Return relevant features for segment comparsion.

(Kondrak 2002: 54)

nltk.metrics.aline.V(p)[source]

Return vowel weight if P is vowel.

(Kondrak 2002: 54)

nltk.metrics.aline.align(str1, str2, epsilon=0)[source]

Compute the alignment of two phonetic strings.

Parameters:
  • str2 (str1,) – Two strings to be aligned
  • epsilon (float (0.0 to 1.0)) – Adjusts threshold similarity score for near-optimal alignments
Rtpye:

list(list(tuple(str, str)))

Returns:

Alignment(s) of str1 and str2

(Kondrak 2002: 51)

nltk.metrics.aline.delta(p, q)[source]

Return weighted sum of difference between P and Q.

(Kondrak 2002: 54)

nltk.metrics.aline.demo()[source]

A demonstration of the result of aligning phonetic sequences used in Kondrak’s (2002) dissertation.

nltk.metrics.aline.diff(p, q, f)[source]

Returns difference between phonetic segments P and Q for feature F.

(Kondrak 2002: 52, 54)

nltk.metrics.aline.sigma_exp(p, q)[source]

Returns score of an expansion/compression.

(Kondrak 2002: 54)

nltk.metrics.aline.sigma_skip(p)[source]

Returns score of an indel of P.

(Kondrak 2002: 54)

nltk.metrics.aline.sigma_sub(p, q)[source]

Returns score of a substitution of P with Q.

(Kondrak 2002: 54)

nltk.metrics.association module

Provides scoring functions for a number of association measures through a generic, abstract implementation in NgramAssocMeasures, and n-specific BigramAssocMeasures and TrigramAssocMeasures.

class nltk.metrics.association.BigramAssocMeasures[source]

Bases: nltk.metrics.association.NgramAssocMeasures

A collection of bigram association measures. Each association measure is provided as a function with three arguments:

bigram_score_fn(n_ii, (n_ix, n_xi), n_xx)

The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example:

n_ii counts (w1, w2), i.e. the bigram being scored n_ix counts (w1, ) n_xi counts (, w2) n_xx counts (*, *), i.e. any bigram

This may be shown with respect to a contingency table:

        w1    ~w1
     ------ ------
 w2 | n_ii | n_oi | = n_xi
     ------ ------
~w2 | n_io | n_oo |
     ------ ------
     = n_ix        TOTAL = n_xx
classmethod chi_sq(n_ii, n_ix_xi_tuple, n_xx)[source]

Scores bigrams using chi-square, i.e. phi-sq multiplied by the number of bigrams, as in Manning and Schutze 5.3.3.

static dice(n_ii, n_ix_xi_tuple, n_xx)[source]

Scores bigrams using Dice’s coefficient.

classmethod fisher(*marginals)[source]

Scores bigrams using Fisher’s Exact Test (Pedersen 1996). Less sensitive to small counts than PMI or Chi Sq, but also more expensive to compute. Requires scipy.

classmethod phi_sq(*marginals)[source]

Scores bigrams using phi-square, the square of the Pearson correlation coefficient.

class nltk.metrics.association.ContingencyMeasures(measures)[source]

Bases: object

Wraps NgramAssocMeasures classes such that the arguments of association measures are contingency table values rather than marginals.

nltk.metrics.association.NGRAM = 0

Marginals index for the ngram count

class nltk.metrics.association.NgramAssocMeasures[source]

Bases: object

An abstract class defining a collection of generic association measures. Each public method returns a score, taking the following arguments:

score_fn(count_of_ngram,
         (count_of_n-1gram_1, ..., count_of_n-1gram_j),
         (count_of_n-2gram_1, ..., count_of_n-2gram_k),
         ...,
         (count_of_1gram_1, ..., count_of_1gram_n),
         count_of_total_words)

See BigramAssocMeasures and TrigramAssocMeasures

Inheriting classes should define a property _n, and a method _contingency which calculates contingency values from marginals in order for all association measures defined here to be usable.

classmethod chi_sq(*marginals)[source]

Scores ngrams using Pearson’s chi-square as in Manning and Schutze 5.3.3.

classmethod jaccard(*marginals)[source]

Scores ngrams using the Jaccard index.

classmethod likelihood_ratio(*marginals)[source]

Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4.

static mi_like(*marginals, **kwargs)[source]

Scores ngrams using a variant of mutual information. The keyword argument power sets an exponent (default 3) for the numerator. No logarithm of the result is calculated.

classmethod pmi(*marginals)[source]

Scores ngrams by pointwise mutual information, as in Manning and Schutze 5.4.

classmethod poisson_stirling(*marginals)[source]

Scores ngrams using the Poisson-Stirling measure.

static raw_freq(*marginals)[source]

Scores ngrams by their frequency

classmethod student_t(*marginals)[source]

Scores ngrams using Student’s t test with independence hypothesis for unigrams, as in Manning and Schutze 5.3.1.

class nltk.metrics.association.QuadgramAssocMeasures[source]

Bases: nltk.metrics.association.NgramAssocMeasures

A collection of quadgram association measures. Each association measure is provided as a function with five arguments:

trigram_score_fn(n_iiii,
                (n_iiix, n_iixi, n_ixii, n_xiii),
                (n_iixx, n_ixix, n_ixxi, n_xixi, n_xxii, n_xiix),
                (n_ixxx, n_xixx, n_xxix, n_xxxi),
                n_all)

The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example: n_iiii counts (w1, w2, w3, w4), i.e. the quadgram being scored n_ixxi counts (w1, , *, w4) n_xxxx counts (, *, *, *), i.e. any quadgram

nltk.metrics.association.TOTAL = -1

Marginals index for the number of words in the data

class nltk.metrics.association.TrigramAssocMeasures[source]

Bases: nltk.metrics.association.NgramAssocMeasures

A collection of trigram association measures. Each association measure is provided as a function with four arguments:

trigram_score_fn(n_iii,
                 (n_iix, n_ixi, n_xii),
                 (n_ixx, n_xix, n_xxi),
                 n_xxx)

The arguments constitute the marginals of a contingency table, counting the occurrences of particular events in a corpus. The letter i in the suffix refers to the appearance of the word in question, while x indicates the appearance of any word. Thus, for example: n_iii counts (w1, w2, w3), i.e. the trigram being scored n_ixx counts (w1, , *) n_xxx counts (, *, *), i.e. any trigram

nltk.metrics.association.UNIGRAMS = -2

Marginals index for a tuple of each unigram count

nltk.metrics.confusionmatrix module

class nltk.metrics.confusionmatrix.ConfusionMatrix(reference, test, sort_by_count=False)[source]

Bases: object

The confusion matrix between a list of reference values and a corresponding list of test values. Entry [r,t] of this matrix is a count of the number of times that the reference value r corresponds to the test value t. E.g.:

>>> from nltk.metrics import ConfusionMatrix
>>> ref  = 'DET NN VB DET JJ NN NN IN DET NN'.split()
>>> test = 'DET VB VB DET NN NN NN IN DET NN'.split()
>>> cm = ConfusionMatrix(ref, test)
>>> print(cm['NN', 'NN'])
3

Note that the diagonal entries Ri=Tj of this matrix corresponds to correct values; and the off-diagonal entries correspond to incorrect values.

key()[source]
pretty_format(show_percents=False, values_in_chart=True, truncate=None, sort_by_count=False)[source]
Returns:

A multi-line string representation of this confusion matrix.

Parameters:
  • truncate (int) – If specified, then only show the specified number of values. Any sorting (e.g., sort_by_count) will be performed before truncation.
  • sort_by_count – If true, then sort by the count of each label in the reference data. I.e., labels that occur more frequently in the reference label will be towards the left edge of the matrix, and labels that occur less frequently will be towards the right edge.

@todo: add marginals?

unicode_repr()
nltk.metrics.confusionmatrix.demo()[source]

nltk.metrics.distance module

Distance Metrics.

Compute the distance between two items (usually strings). As metrics, they must satisfy the following three requirements:

  1. d(a, a) = 0
  2. d(a, b) >= 0
  3. d(a, c) <= d(a, b) + d(b, c)
nltk.metrics.distance.binary_distance(label1, label2)[source]

Simple equality test.

0.0 if the labels are identical, 1.0 if they are different.

>>> from nltk.metrics import binary_distance
>>> binary_distance(1,1)
0.0
>>> binary_distance(1,3)
1.0
nltk.metrics.distance.custom_distance(file)[source]
nltk.metrics.distance.demo()[source]
nltk.metrics.distance.edit_distance(s1, s2, substitution_cost=1, transpositions=False)[source]

Calculate the Levenshtein edit-distance between two strings. The edit distance is the number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2. For example, transforming “rain” to “shine” requires three steps, consisting of two substitutions and one insertion: “rain” -> “sain” -> “shin” -> “shine”. These operations could have been done in other orders, but at least three steps are needed.

Allows specifying the cost of substitution edits (e.g., “a” -> “b”), because sometimes it makes sense to assign greater penalties to substitutions.

This also optionally allows transposition edits (e.g., “ab” -> “ba”), though this is disabled by default.

Parameters:
  • s2 (str) – The strings to be analysed
  • transpositions (bool) – Whether to allow transposition edits

:rtype int

nltk.metrics.distance.fractional_presence(label)[source]
nltk.metrics.distance.interval_distance(label1, label2)[source]

Krippendorff’s interval distance metric

>>> from nltk.metrics import interval_distance
>>> interval_distance(1,10)
81

Krippendorff 1980, Content Analysis: An Introduction to its Methodology

nltk.metrics.distance.jaccard_distance(label1, label2)[source]

Distance metric comparing set-similarity.

nltk.metrics.distance.masi_distance(label1, label2)[source]

Distance metric that takes into account partial agreement when multiple labels are assigned.

>>> from nltk.metrics import masi_distance
>>> masi_distance(set([1, 2]), set([1, 2, 3, 4]))
0.335

Passonneau 2006, Measuring Agreement on Set-Valued Items (MASI) for Semantic and Pragmatic Annotation.

nltk.metrics.distance.presence(label)[source]

Higher-order function to test presence of a given label

nltk.metrics.paice module

Counts Paice’s performance statistics for evaluating stemming algorithms.

What is required:
  • A dictionary of words grouped by their real lemmas
  • A dictionary of words grouped by stems from a stemming algorithm

When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted.

References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42–50.

class nltk.metrics.paice.Paice(lemmas, stems)[source]

Bases: object

Class for storing lemmas, stems and evaluation metrics.

update()[source]

Update statistics after lemmas and stems have been set.

nltk.metrics.paice.demo()[source]

Demonstration of the module.

nltk.metrics.paice.get_words_from_dictionary(lemmas)[source]

Get original set of words used for analysis.

Parameters:lemmas – A dictionary where keys are lemmas and values are sets

or lists of words corresponding to that lemma. :type lemmas: dict(str): list(str) :return: Set of words that exist as values in the dictionary :rtype: set(str)

nltk.metrics.scores module

nltk.metrics.scores.accuracy(reference, test)[source]

Given a list of reference values and a corresponding list of test values, return the fraction of corresponding values that are equal. In particular, return the fraction of indices 0<i<=len(test) such that test[i] == reference[i].

Parameters:
  • reference (list) – An ordered list of reference values.
  • test (list) – A list of values to compare against the corresponding reference values.
Raises:

ValueError – If reference and length do not have the same length.

nltk.metrics.scores.approxrand(a, b, **kwargs)[source]

Returns an approximate significance level between two lists of independently generated test values.

Approximate randomization calculates significance by randomly drawing from a sample of the possible permutations. At the limit of the number of possible permutations, the significance level is exact. The approximate significance level is the sample mean number of times the statistic of the permutated lists varies from the actual statistic of the unpermuted argument lists.

Returns:

a tuple containing an approximate significance level, the count of the number of times the pseudo-statistic varied from the actual statistic, and the number of shuffles

Return type:

tuple

Parameters:
  • a (list) – a list of test values
  • b (list) – another list of independently generated test values
nltk.metrics.scores.demo()[source]
nltk.metrics.scores.f_measure(reference, test, alpha=0.5)[source]

Given a set of reference values and a set of test values, return the f-measure of the test values, when compared against the reference values. The f-measure is the harmonic mean of the precision and recall, weighted by alpha. In particular, given the precision p and recall r defined by:

  • p = card(reference intersection test)/card(test)
  • r = card(reference intersection test)/card(reference)

The f-measure is:

  • 1/(alpha/p + (1-alpha)/r)

If either reference or test is empty, then f_measure returns None.

Parameters:
  • reference (set) – A set of reference values.
  • test (set) – A set of values to compare against the reference set.
Return type:

float or None

nltk.metrics.scores.log_likelihood(reference, test)[source]

Given a list of reference values and a corresponding list of test probability distributions, return the average log likelihood of the reference values, given the probability distributions.

Parameters:
  • reference (list) – A list of reference values
  • test (list(ProbDistI)) – A list of probability distributions over values to compare against the corresponding reference values.
nltk.metrics.scores.precision(reference, test)[source]

Given a set of reference values and a set of test values, return the fraction of test values that appear in the reference set. In particular, return card(reference intersection test)/card(test). If test is empty, then return None.

Parameters:
  • reference (set) – A set of reference values.
  • test (set) – A set of values to compare against the reference set.
Return type:

float or None

nltk.metrics.scores.recall(reference, test)[source]

Given a set of reference values and a set of test values, return the fraction of reference values that appear in the test set. In particular, return card(reference intersection test)/card(reference). If reference is empty, then return None.

Parameters:
  • reference (set) – A set of reference values.
  • test (set) – A set of values to compare against the reference set.
Return type:

float or None

nltk.metrics.segmentation module

Text Segmentation Metrics

  1. Windowdiff
Pevzner, L., and Hearst, M., A Critique and Improvement of
an Evaluation Metric for Text Segmentation,

Computational Linguistics 28, 19-36

  1. Generalized Hamming Distance

Bookstein A., Kulyukin V.A., Raita T. Generalized Hamming Distance Information Retrieval 5, 2002, pp 353-375

Baseline implementation in C++ http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html

Study describing benefits of Generalized Hamming Distance Versus WindowDiff for evaluating text segmentation tasks Begsten, Y. Quel indice pour mesurer l’efficacite en segmentation de textes ? TALN 2009

  1. Pk text segmentation metric

Beeferman D., Berger A., Lafferty J. (1999) Statistical Models for Text Segmentation Machine Learning, 34, 177-210

nltk.metrics.segmentation.ghd(ref, hyp, ins_cost=2.0, del_cost=2.0, shift_cost_coeff=1.0, boundary='1')[source]

Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.

A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.

Recommended parameter values are a shift_cost_coeff of 2. Associated with a ins_cost, and del_cost equal to the mean segment length in the reference segmentation.

>>> # Same examples as Kulyukin C++ implementation
>>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5)
0.5
>>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5)
2.0
>>> ghd('011', '110', 1.0, 1.0, 0.5)
1.0
>>> ghd('1', '0', 1.0, 1.0, 0.5)
1.0
>>> ghd('111', '000', 1.0, 1.0, 0.5)
3.0
>>> ghd('000', '111', 1.0, 2.0, 0.5)
6.0
Parameters:
  • ref (str or list) – the reference segmentation
  • hyp (str or list) – the hypothetical segmentation
  • ins_cost (float) – insertion cost
  • del_cost (float) – deletion cost
  • shift_cost_coeff – constant used to compute the cost of a shift.

shift cost = shift_cost_coeff * |i - j| where i and j are the positions indicating the shift :type shift_cost_coeff: float :param boundary: boundary value :type boundary: str or int or bool :rtype: float

nltk.metrics.segmentation.pk(ref, hyp, k=None, boundary='1')[source]

Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.

>>> '%.2f' % pk('0100'*100, '1'*400, 2)
'0.50'
>>> '%.2f' % pk('0100'*100, '0'*400, 2)
'0.50'
>>> '%.2f' % pk('0100'*100, '0100'*100, 2)
'0.00'
Parameters:
  • ref (str or list) – the reference segmentation
  • hyp (str or list) – the segmentation to evaluate
  • k – window size, if None, set to half of the average reference segment length
  • boundary (str or int or bool) – boundary value
Return type:

float

nltk.metrics.segmentation.setup_module(module)[source]
nltk.metrics.segmentation.windowdiff(seg1, seg2, k, boundary='1', weighted=False)[source]

Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.

>>> s1 = "000100000010"
>>> s2 = "000010000100"
>>> s3 = "100000010000"
>>> '%.2f' % windowdiff(s1, s1, 3)
'0.00'
>>> '%.2f' % windowdiff(s1, s2, 3)
'0.30'
>>> '%.2f' % windowdiff(s2, s3, 3)
'0.80'
Parameters:
  • seg1 (str or list) – a segmentation
  • seg2 (str or list) – a segmentation
  • k (int) – window width
  • boundary (str or int or bool) – boundary value
  • weighted (boolean) – use the weighted variant of windowdiff
Return type:

float

nltk.metrics.spearman module

nltk.metrics.spearman.ranks_from_scores(scores, rank_gap=1e-15)[source]

Given a sequence of (key, score) tuples, yields each key with an increasing rank, tying with previous key’s rank if the difference between their scores is less than rank_gap. Suitable for use as an argument to spearman_correlation.

nltk.metrics.spearman.ranks_from_sequence(seq)[source]

Given a sequence, yields each element with an increasing rank, suitable for use as an argument to spearman_correlation.

nltk.metrics.spearman.spearman_correlation(ranks1, ranks2)[source]

Returns the Spearman correlation coefficient for two rankings, which should be dicts or sequences of (key, rank). The coefficient ranges from -1.0 (ranks are opposite) to 1.0 (ranks are identical), and is only calculated for keys in both rankings (for meaningful results, remove keys present in only one list before ranking).

Module contents

NLTK Metrics

Classes and methods for scoring processing modules.