nltk.classify package

Submodules

nltk.classify.api module

Interfaces for labeling tokens with category labels (or “class labels”).

ClassifierI is a standard interface for “single-category classification”, in which the set of categories is known, the number of categories is finite, and each text belongs to exactly one category.

MultiClassifierI is a standard interface for “multi-category classification”, which is like single-category classification except that each text belongs to zero or more categories.

class nltk.classify.api.ClassifierI[source]

Bases: object

A processing interface for labeling tokens with a single category label (or “class”). Labels are typically strs or ints, but can be any immutable type. The set of labels that the classifier chooses from must be fixed and finite.

Subclasses must define:
  • labels()
  • either classify() or classify_many() (or both)
Subclasses may define:
  • either prob_classify() or prob_classify_many() (or both)
classify(featureset)[source]
Returns:the most appropriate label for the given featureset.
Return type:label
classify_many(featuresets)[source]

Apply self.classify() to each element of featuresets. I.e.:

return [self.classify(fs) for fs in featuresets]
Return type:list(label)
labels()[source]
Returns:the list of category labels used by this classifier.
Return type:list of (immutable)
prob_classify(featureset)[source]
Returns:a probability distribution over labels for the given featureset.
Return type:ProbDistI
prob_classify_many(featuresets)[source]

Apply self.prob_classify() to each element of featuresets. I.e.:

return [self.prob_classify(fs) for fs in featuresets]
Return type:list(ProbDistI)
class nltk.classify.api.MultiClassifierI[source]

Bases: object

A processing interface for labeling tokens with zero or more category labels (or “labels”). Labels are typically strs or ints, but can be any immutable type. The set of labels that the multi-classifier chooses from must be fixed and finite.

Subclasses must define:
  • labels()
  • either classify() or classify_many() (or both)
Subclasses may define:
  • either prob_classify() or prob_classify_many() (or both)
classify(featureset)[source]
Returns:the most appropriate set of labels for the given featureset.
Return type:set(label)
classify_many(featuresets)[source]

Apply self.classify() to each element of featuresets. I.e.:

return [self.classify(fs) for fs in featuresets]
Return type:list(set(label))
labels()[source]
Returns:the list of category labels used by this classifier.
Return type:list of (immutable)
prob_classify(featureset)[source]
Returns:a probability distribution over sets of labels for the given featureset.
Return type:ProbDistI
prob_classify_many(featuresets)[source]

Apply self.prob_classify() to each element of featuresets. I.e.:

return [self.prob_classify(fs) for fs in featuresets]
Return type:list(ProbDistI)

nltk.classify.decisiontree module

A classifier model that decides which label to assign to a token on the basis of a tree structure, where branches correspond to conditions on feature values, and leaves correspond to label assignments.

class nltk.classify.decisiontree.DecisionTreeClassifier(label, feature_name=None, decisions=None, default=None)[source]

Bases: nltk.classify.api.ClassifierI

static best_binary_stump(feature_names, labeled_featuresets, feature_values, verbose=False)[source]
static best_stump(feature_names, labeled_featuresets, verbose=False)[source]
static binary_stump(feature_name, feature_value, labeled_featuresets)[source]
classify(featureset)[source]
Returns:the most appropriate label for the given featureset.
Return type:label
error(labeled_featuresets)[source]
labels()[source]
Returns:the list of category labels used by this classifier.
Return type:list of (immutable)
static leaf(labeled_featuresets)[source]
pretty_format(width=70, prefix='', depth=4)[source]

Return a string containing a pretty-printed version of this decision tree. Each line in this string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the decision tree.

pseudocode(prefix='', depth=4)[source]

Return a string representation of this decision tree that expresses the decisions it makes as a nested set of pseudocode if statements.

refine(labeled_featuresets, entropy_cutoff, depth_cutoff, support_cutoff, binary=False, feature_values=None, verbose=False)[source]
static stump(feature_name, labeled_featuresets)[source]
static train(labeled_featuresets, entropy_cutoff=0.05, depth_cutoff=100, support_cutoff=10, binary=False, feature_values=None, verbose=False)[source]
Parameters:binary – If true, then treat all feature/value pairs as individual binary features, rather than using a single n-way branch for each feature.
unicode_repr

Return repr(self).

nltk.classify.decisiontree.demo()[source]
nltk.classify.decisiontree.f(x)[source]

nltk.classify.maxent module

A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy. A probability distribution is “empirically consistent” with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data.

Terminology: ‘feature’

The term feature is usually used to refer to some property of an unlabeled token. For example, when performing word sense disambiguation, we might define a 'prevword' feature whose value is the word preceding the target word. However, in the context of maxent modeling, the term feature is typically used to refer to a property of a “labeled” token. In order to prevent confusion, we will introduce two distinct terms to disambiguate these two different concepts:

  • An “input-feature” is a property of an unlabeled token.
  • A “joint-feature” is a property of a labeled token.

In the rest of the nltk.classify module, the term “features” is used to refer to what we will call “input-features” in this module.

In literature that describes and discusses maximum entropy models, input-features are typically called “contexts”, and joint-features are simply referred to as “features”.

Converting Input-Features to Joint-Features

In maximum entropy models, joint-features are required to have numeric values. Typically, each input-feature input_feat is mapped to a set of joint-features of the form:

joint_feat(token, label) = { 1 if input_feat(token) == feat_val
{ and label == some_label
{
{ 0 otherwise

For all values of feat_val and some_label. This mapping is performed by classes that implement the MaxentFeatureEncodingI interface.

class nltk.classify.maxent.BinaryMaxentFeatureEncoding(labels, mapping, unseen_features=False, alwayson_features=False)[source]

Bases: nltk.classify.maxent.MaxentFeatureEncodingI

A feature encoding that generates vectors containing a binary joint-features of the form:

joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label)
{
{ 0 otherwise

Where fname is the name of an input-feature, fval is a value for that input-feature, and label is a label.

Typically, these features are constructed based on a training corpus, using the train() method. This method will create one feature for each combination of fname, fval, and label that occurs at least once in the training corpus.

The unseen_features parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:

joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname])
{ and l == label
{
{ 0 otherwise

Where is_unseen(fname, fval) is true if the encoding does not contain any joint features that are true when fs[fname]==fval.

The alwayson_features parameter can be used to add “always-on features”, which have the form:

|  joint_feat(fs, l) = { 1 if (l == label)
|                      {
|                      { 0 otherwise

These always-on features allow the maxent model to directly model the prior probabilities of each label.

describe(f_id)[source]
Returns:A string describing the value of the joint-feature whose index in the generated feature vectors is fid.
Return type:str
encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type:list(tuple(int, int))
labels()[source]
Returns:A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.
Return type:list
length()[source]
Returns:The size of the fixed-length joint-feature vectors that are generated by this encoding.
Return type:int
classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]

Construct and return new feature encoding, based on a given training corpus train_toks. See the class description BinaryMaxentFeatureEncoding for a description of the joint-features that will be included in this encoding.

Parameters:
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
  • count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than count_cutoff times in the training corpus, then that joint-feature is not included in the generated encoding.
  • labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in train_toks will be used.
  • options – Extra parameters for the constructor, such as unseen_features and alwayson_features.
nltk.classify.maxent.ConditionalExponentialClassifier

Alias for MaxentClassifier.

alias of nltk.classify.maxent.MaxentClassifier

class nltk.classify.maxent.FunctionBackedMaxentFeatureEncoding(func, length, labels)[source]

Bases: nltk.classify.maxent.MaxentFeatureEncodingI

A feature encoding that calls a user-supplied function to map a given featureset/label pair to a sparse joint-feature vector.

describe(fid)[source]
Returns:A string describing the value of the joint-feature whose index in the generated feature vectors is fid.
Return type:str
encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type:list(tuple(int, int))
labels()[source]
Returns:A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.
Return type:list
length()[source]
Returns:The size of the fixed-length joint-feature vectors that are generated by this encoding.
Return type:int
class nltk.classify.maxent.GISEncoding(labels, mapping, unseen_features=False, alwayson_features=False, C=None)[source]

Bases: nltk.classify.maxent.BinaryMaxentFeatureEncoding

A binary feature encoding which adds one new joint-feature to the joint-features defined by BinaryMaxentFeatureEncoding: a correction feature, whose value is chosen to ensure that the sparse vector always sums to a constant non-negative number. This new feature is used to ensure two preconditions for the GIS training algorithm:

  • At least one feature vector index must be nonzero for every token.
  • The feature vector must sum to a constant non-negative number for every token.
C

The non-negative constant that all encoded feature vectors will sum to.

describe(f_id)[source]
Returns:A string describing the value of the joint-feature whose index in the generated feature vectors is fid.
Return type:str
encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type:list(tuple(int, int))
length()[source]
Returns:The size of the fixed-length joint-feature vectors that are generated by this encoding.
Return type:int
class nltk.classify.maxent.MaxentClassifier(encoding, weights, logarithmic=True)[source]

Bases: nltk.classify.api.ClassifierI

A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each (featureset, label) pair to a vector. The probability of each label is then computed using the following equation:

                          dotprod(weights, encode(fs,label))
prob(fs|label) = ---------------------------------------------------
                 sum(dotprod(weights, encode(fs,l)) for l in labels)

Where dotprod is the dot product:

dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))
ALGORITHMS = ['GIS', 'IIS', 'MEGAM', 'TADM']

A list of the algorithm names that are accepted for the train() method’s algorithm parameter.

classify(featureset)[source]
Returns:the most appropriate label for the given featureset.
Return type:label
explain(featureset, columns=4)[source]

Print a table showing the effect of each of the features in the given feature set, and how they combine to determine the probabilities of each label for that featureset.

labels()[source]
Returns:the list of category labels used by this classifier.
Return type:list of (immutable)
most_informative_features(n=10)[source]

Generates the ranked list of informative features from most to least.

prob_classify(featureset)[source]
Returns:a probability distribution over labels for the given featureset.
Return type:ProbDistI
set_weights(new_weights)[source]

Set the feature weight vector for this classifier. :param new_weights: The new feature weight vector. :type new_weights: list of float

show_most_informative_features(n=10, show='all')[source]
Parameters:
  • show (str) – all, neg, or pos (for negative-only or positive-only)
  • n (int) – The no. of top features
classmethod train(train_toks, algorithm=None, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **cutoffs)[source]

Train a new maxent classifier based on the given corpus of training samples. This classifier will have its weights chosen to maximize entropy while remaining empirically consistent with the training corpus.

Return type:

MaxentClassifier

Returns:

The new maxent classifier

Parameters:
  • train_toks (list) – Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.
  • algorithm (str) –

    A case-insensitive string, specifying which algorithm should be used to train the classifier. The following algorithms are currently available.

    • Iterative Scaling Methods: Generalized Iterative Scaling ('GIS'), Improved Iterative Scaling ('IIS')
    • External Libraries (requiring megam): LM-BFGS algorithm, with training performed by Megam ('megam')

    The default algorithm is 'IIS'.

  • trace (int) – The level of diagnostic tracing output to produce. Higher values produce more verbose output.
  • encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets into feature vectors. If none is specified, then a BinaryMaxentFeatureEncoding will be built based on the features that are attested in the training corpus.
  • labels (list(str)) – The set of possible labels. If none is given, then the set of all labels attested in the training data will be used instead.
  • gaussian_prior_sigma – The sigma value for a gaussian prior on model weights. Currently, this is supported by megam. For other algorithms, its value is ignored.
  • cutoffs

    Arguments specifying various conditions under which the training should be halted. (Some of the cutoff conditions are not supported by some algorithms.)

    • max_iter=v: Terminate after v iterations.
    • min_ll=v: Terminate after the negative average log-likelihood drops under v.
    • min_lldelta=v: Terminate if a single iteration improves log likelihood by less than v.
unicode_repr()

Return repr(self).

weights()[source]
Returns:The feature weight vector for this classifier.
Return type:list of float
class nltk.classify.maxent.MaxentFeatureEncodingI[source]

Bases: object

A mapping that converts a set of input-feature values to a vector of joint-feature values, given a label. This conversion is necessary to translate featuresets into a format that can be used by maximum entropy models.

The set of joint-features used by a given encoding is fixed, and each index in the generated joint-feature vectors corresponds to a single joint-feature. The length of the generated joint-feature vectors is therefore constant (for a given encoding).

Because the joint-feature vectors generated by MaxentFeatureEncodingI are typically very sparse, they are represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Feature encodings are generally created using the train() method, which generates an appropriate encoding based on the input-feature values and labels that are present in a given corpus.

describe(fid)[source]
Returns:A string describing the value of the joint-feature whose index in the generated feature vectors is fid.
Return type:str
encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type:list(tuple(int, int))
labels()[source]
Returns:A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.
Return type:list
length()[source]
Returns:The size of the fixed-length joint-feature vectors that are generated by this encoding.
Return type:int
train(train_toks)[source]

Construct and return new feature encoding, based on a given training corpus train_toks.

Parameters:train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
class nltk.classify.maxent.TadmEventMaxentFeatureEncoding(labels, mapping, unseen_features=False, alwayson_features=False)[source]

Bases: nltk.classify.maxent.BinaryMaxentFeatureEncoding

describe(fid)[source]
Returns:A string describing the value of the joint-feature whose index in the generated feature vectors is fid.
Return type:str
encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type:list(tuple(int, int))
labels()[source]
Returns:A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.
Return type:list
length()[source]
Returns:The size of the fixed-length joint-feature vectors that are generated by this encoding.
Return type:int
classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]

Construct and return new feature encoding, based on a given training corpus train_toks. See the class description BinaryMaxentFeatureEncoding for a description of the joint-features that will be included in this encoding.

Parameters:
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
  • count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than count_cutoff times in the training corpus, then that joint-feature is not included in the generated encoding.
  • labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in train_toks will be used.
  • options – Extra parameters for the constructor, such as unseen_features and alwayson_features.
class nltk.classify.maxent.TadmMaxentClassifier(encoding, weights, logarithmic=True)[source]

Bases: nltk.classify.maxent.MaxentClassifier

classmethod train(train_toks, **kwargs)[source]

Train a new maxent classifier based on the given corpus of training samples. This classifier will have its weights chosen to maximize entropy while remaining empirically consistent with the training corpus.

Return type:

MaxentClassifier

Returns:

The new maxent classifier

Parameters:
  • train_toks (list) – Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.
  • algorithm (str) –

    A case-insensitive string, specifying which algorithm should be used to train the classifier. The following algorithms are currently available.

    • Iterative Scaling Methods: Generalized Iterative Scaling ('GIS'), Improved Iterative Scaling ('IIS')
    • External Libraries (requiring megam): LM-BFGS algorithm, with training performed by Megam ('megam')

    The default algorithm is 'IIS'.

  • trace (int) – The level of diagnostic tracing output to produce. Higher values produce more verbose output.
  • encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets into feature vectors. If none is specified, then a BinaryMaxentFeatureEncoding will be built based on the features that are attested in the training corpus.
  • labels (list(str)) – The set of possible labels. If none is given, then the set of all labels attested in the training data will be used instead.
  • gaussian_prior_sigma – The sigma value for a gaussian prior on model weights. Currently, this is supported by megam. For other algorithms, its value is ignored.
  • cutoffs

    Arguments specifying various conditions under which the training should be halted. (Some of the cutoff conditions are not supported by some algorithms.)

    • max_iter=v: Terminate after v iterations.
    • min_ll=v: Terminate after the negative average log-likelihood drops under v.
    • min_lldelta=v: Terminate if a single iteration improves log likelihood by less than v.
class nltk.classify.maxent.TypedMaxentFeatureEncoding(labels, mapping, unseen_features=False, alwayson_features=False)[source]

Bases: nltk.classify.maxent.MaxentFeatureEncodingI

A feature encoding that generates vectors containing integer, float and binary joint-features of the form:

Binary (for string and boolean features):

joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label)
{
{ 0 otherwise

Value (for integer and float features):

joint_feat(fs, l) = { fval if (fs[fname] == type(fval))
{ and (l == label)
{
{ not encoded otherwise

Where fname is the name of an input-feature, fval is a value for that input-feature, and label is a label.

Typically, these features are constructed based on a training corpus, using the train() method.

For string and boolean features [type(fval) not in (int, float)] this method will create one feature for each combination of fname, fval, and label that occurs at least once in the training corpus.

For integer and float features [type(fval) in (int, float)] this method will create one feature for each combination of fname and label that occurs at least once in the training corpus.

For binary features the unseen_features parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:

joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname])
{ and l == label
{
{ 0 otherwise

Where is_unseen(fname, fval) is true if the encoding does not contain any joint features that are true when fs[fname]==fval.

The alwayson_features parameter can be used to add “always-on features”, which have the form:

joint_feat(fs, l) = { 1 if (l == label)
{
{ 0 otherwise

These always-on features allow the maxent model to directly model the prior probabilities of each label.

describe(f_id)[source]
Returns:A string describing the value of the joint-feature whose index in the generated feature vectors is fid.
Return type:str
encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type:list(tuple(int, int))
labels()[source]
Returns:A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.
Return type:list
length()[source]
Returns:The size of the fixed-length joint-feature vectors that are generated by this encoding.
Return type:int
classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]

Construct and return new feature encoding, based on a given training corpus train_toks. See the class description TypedMaxentFeatureEncoding for a description of the joint-features that will be included in this encoding.

Note: recognized feature values types are (int, float), over types are interpreted as regular binary features.

Parameters:
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
  • count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than count_cutoff times in the training corpus, then that joint-feature is not included in the generated encoding.
  • labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in train_toks will be used.
  • options – Extra parameters for the constructor, such as unseen_features and alwayson_features.
nltk.classify.maxent.calculate_deltas(train_toks, classifier, unattested, ffreq_empirical, nfmap, nfarray, nftranspose, encoding)[source]

Calculate the update values for the classifier weights for this iteration of IIS. These update weights are the value of delta that solves the equation:

ffreq_empirical[i]
       =
SUM[fs,l] (classifier.prob_classify(fs).prob(l) *
           feature_vector(fs,l)[i] *
           exp(delta[i] * nf(feature_vector(fs,l))))
Where:
  • (fs,l) is a (featureset, label) tuple from train_toks
  • feature_vector(fs,l) = encoding.encode(fs,l)
  • nf(vector) = sum([val for (id,val) in vector])

This method uses Newton’s method to solve this equation for delta[i]. In particular, it starts with a guess of delta[i] = 1; and iteratively updates delta with:

delta[i] -= (ffreq_empirical[i] - sum1[i])/(-sum2[i])

until convergence, where sum1 and sum2 are defined as:

sum1[i](delta) = SUM[fs,l] f[i](fs,l,delta)
sum2[i](delta) = SUM[fs,l] (f[i](fs,l,delta).nf(feature_vector(fs,l)))
f[i](fs,l,delta) = (classifier.prob_classify(fs).prob(l) .
feature_vector(fs,l)[i] .
exp(delta[i] . nf(feature_vector(fs,l))))

Note that sum1 and sum2 depend on delta; so they need to be re-computed each iteration.

The variables nfmap, nfarray, and nftranspose are used to generate a dense encoding for nf(ltext). This allows _deltas to calculate sum1 and sum2 using matrices, which yields a significant performance improvement.

Parameters:
  • train_toks (list(tuple(dict, str))) – The set of training tokens.
  • classifier (ClassifierI) – The current classifier.
  • ffreq_empirical (sequence of float) – An array containing the empirical frequency for each feature. The ith element of this array is the empirical frequency for feature i.
  • unattested (sequence of int) – An array that is 1 for features that are not attested in the training data; and 0 for features that are attested. In other words, unattested[i]==0 iff ffreq_empirical[i]==0.
  • nfmap (dict(int -> int)) – A map that can be used to compress nf to a dense vector.
  • nfarray (array(float)) – An array that can be used to uncompress nf from a dense vector.
  • nftranspose (array(float)) – The transpose of nfarray
nltk.classify.maxent.calculate_empirical_fcount(train_toks, encoding)[source]
nltk.classify.maxent.calculate_estimated_fcount(classifier, train_toks, encoding)[source]
nltk.classify.maxent.calculate_nfmap(train_toks, encoding)[source]

Construct a map that can be used to compress nf (which is typically sparse).

nf(feature_vector) is the sum of the feature values for feature_vector.

This represents the number of features that are active for a given labeled text. This method finds all values of nf(t) that are attested for at least one token in the given list of training tokens; and constructs a dictionary mapping these attested values to a continuous range 0…N. For example, if the only values of nf() that were attested were 3, 5, and 7, then _nfmap might return the dictionary {3:0, 5:1, 7:2}.

Returns:A map that can be used to compress nf to a dense vector.
Return type:dict(int -> int)
nltk.classify.maxent.demo()[source]
nltk.classify.maxent.train_maxent_classifier_with_gis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)[source]

Train a new ConditionalExponentialClassifier, using the given training samples, using the Generalized Iterative Scaling algorithm. This ConditionalExponentialClassifier will encode the model that maximizes entropy from all the models that are empirically consistent with train_toks.

See:train_maxent_classifier() for parameter descriptions.
nltk.classify.maxent.train_maxent_classifier_with_iis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)[source]

Train a new ConditionalExponentialClassifier, using the given training samples, using the Improved Iterative Scaling algorithm. This ConditionalExponentialClassifier will encode the model that maximizes entropy from all the models that are empirically consistent with train_toks.

See:train_maxent_classifier() for parameter descriptions.
nltk.classify.maxent.train_maxent_classifier_with_megam(train_toks, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **kwargs)[source]

Train a new ConditionalExponentialClassifier, using the given training samples, using the external megam library. This ConditionalExponentialClassifier will encode the model that maximizes entropy from all the models that are empirically consistent with train_toks.

See:train_maxent_classifier() for parameter descriptions.
See:nltk.classify.megam

nltk.classify.megam module

A set of functions used to interface with the external megam maxent optimization package. Before megam can be used, you should tell NLTK where it can find the megam binary, using the config_megam() function. Typical usage:

>>> from nltk.classify import megam
>>> megam.config_megam() # pass path to megam if not found in PATH 
[Found megam: ...]

Use with MaxentClassifier. Example below, see MaxentClassifier documentation for details.

nltk.classify.MaxentClassifier.train(corpus, ‘megam’)
nltk.classify.megam.call_megam(args)[source]

Call the megam binary with the given arguments.

nltk.classify.megam.config_megam(bin=None)[source]

Configure NLTK’s interface to the megam maxent optimization package.

Parameters:bin (str) – The full path to the megam binary. If not specified, then nltk will search the system for a megam binary; and if one is not found, it will raise a LookupError exception.
nltk.classify.megam.parse_megam_weights(s, features_count, explicit=True)[source]

Given the stdout output generated by megam when training a model, return a numpy array containing the corresponding weight vector. This function does not currently handle bias features.

nltk.classify.megam.write_megam_file(train_toks, encoding, stream, bernoulli=True, explicit=True)[source]

Generate an input file for megam based on the given corpus of classified tokens.

Parameters:
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
  • encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets into feature vectors. May optionally implement a cost() method in order to assign different costs to different class predictions.
  • stream (stream) – The stream to which the megam input file should be written.
  • bernoulli – If true, then use the ‘bernoulli’ format. I.e., all joint features have binary values, and are listed iff they are true. Otherwise, list feature values explicitly. If bernoulli=False, then you must call megam with the -fvals option.
  • explicit – If true, then use the ‘explicit’ format. I.e., list the features that would fire for any of the possible labels, for each token. If explicit=True, then you must call megam with the -explicit option.

nltk.classify.naivebayes module

A classifier based on the Naive Bayes algorithm. In order to find the probability for a label, this algorithm first uses the Bayes rule to express P(label|features) in terms of P(label) and P(features|label):

P(label) * P(features|label)
P(label|features) = ——————————
P(features)

The algorithm then makes the ‘naive’ assumption that all features are independent, given the label:

P(label) * P(f1|label) * … * P(fn|label)
P(label|features) = ——————————————–
P(features)

Rather than computing P(features) explicitly, the algorithm just calculates the numerator for each label, and normalizes them so they sum to one:

P(label) * P(f1|label) * … * P(fn|label)
P(label|features) = ——————————————–
SUM[l]( P(l) * P(f1|l) * … * P(fn|l) )
class nltk.classify.naivebayes.NaiveBayesClassifier(label_probdist, feature_probdist)[source]

Bases: nltk.classify.api.ClassifierI

A Naive Bayes classifier. Naive Bayes classifiers are paramaterized by two probability distributions:

  • P(label) gives the probability that an input will receive each label, given no information about the input’s features.
  • P(fname=fval|label) gives the probability that a given feature (fname) will receive a given value (fval), given that the label (label).

If the classifier encounters an input with a feature that has never been seen with any label, then rather than assigning a probability of 0 to all labels, it will ignore that feature.

The feature value ‘None’ is reserved for unseen feature values; you generally should not use ‘None’ as a feature value for one of your own features.

classify(featureset)[source]
Returns:the most appropriate label for the given featureset.
Return type:label
labels()[source]
Returns:the list of category labels used by this classifier.
Return type:list of (immutable)
most_informative_features(n=100)[source]

Return a list of the ‘most informative’ features used by this classifier. For the purpose of this function, the informativeness of a feature (fname,fval) is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:

max[ P(fname=fval|label1) / P(fname=fval|label2) ]
prob_classify(featureset)[source]
Returns:a probability distribution over labels for the given featureset.
Return type:ProbDistI
show_most_informative_features(n=10)[source]
classmethod train(labeled_featuresets, estimator=<class 'nltk.probability.ELEProbDist'>)[source]
Parameters:labeled_featuresets – A list of classified featuresets, i.e., a list of tuples (featureset, label).
nltk.classify.naivebayes.demo()[source]

nltk.classify.positivenaivebayes module

A variant of the Naive Bayes Classifier that performs binary classification with partially-labeled training sets. In other words, assume we want to build a classifier that assigns each example to one of two complementary classes (e.g., male names and female names). If we have a training set with labeled examples for both classes, we can use a standard Naive Bayes Classifier. However, consider the case when we only have labeled examples for one of the classes, and other, unlabeled, examples. Then, assuming a prior distribution on the two labels, we can use the unlabeled set to estimate the frequencies of the various features.

Let the two possible labels be 1 and 0, and let’s say we only have examples labeled 1 and unlabeled examples. We are also given an estimate of P(1).

We compute P(feature|1) exactly as in the standard case.

To compute P(feature|0), we first estimate P(feature) from the unlabeled set (we are assuming that the unlabeled examples are drawn according to the given prior distribution) and then express the conditional probability as:

P(feature) - P(feature|1) * P(1)
P(feature|0) = ———————————-
P(0)

Example:

>>> from nltk.classify import PositiveNaiveBayesClassifier

Some sentences about sports:

>>> sports_sentences = [ 'The team dominated the game',
...                      'They lost the ball',
...                      'The game was intense',
...                      'The goalkeeper catched the ball',
...                      'The other team controlled the ball' ]

Mixed topics, including sports:

>>> various_sentences = [ 'The President did not comment',
...                       'I lost the keys',
...                       'The team won the game',
...                       'Sara has two kids',
...                       'The ball went off the court',
...                       'They had the ball for the whole game',
...                       'The show is over' ]

The features of a sentence are simply the words it contains:

>>> def features(sentence):
...     words = sentence.lower().split()
...     return dict(('contains(%s)' % w, True) for w in words)

We use the sports sentences as positive examples, the mixed ones ad unlabeled examples:

>>> positive_featuresets = list(map(features, sports_sentences))
>>> unlabeled_featuresets = list(map(features, various_sentences))
>>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
...                                                 unlabeled_featuresets)

Is the following sentence about sports?

>>> classifier.classify(features('The cat is on the table'))
False

What about this one?

>>> classifier.classify(features('My team lost the game'))
True
class nltk.classify.positivenaivebayes.PositiveNaiveBayesClassifier(label_probdist, feature_probdist)[source]

Bases: nltk.classify.naivebayes.NaiveBayesClassifier

static train(positive_featuresets, unlabeled_featuresets, positive_prob_prior=0.5, estimator=<class 'nltk.probability.ELEProbDist'>)[source]
Parameters:
  • positive_featuresets – A list of featuresets that are known as positive examples (i.e., their label is True).
  • unlabeled_featuresets – A list of featuresets whose label is unknown.
  • positive_prob_prior – A prior estimate of the probability of the label True (default 0.5).
nltk.classify.positivenaivebayes.demo()[source]

nltk.classify.rte_classify module

Simple classifier for RTE corpus.

It calculates the overlap in words and named entities between text and hypothesis, and also whether there are words / named entities in the hypothesis which fail to occur in the text, since this is an indicator that the hypothesis is more informative than (i.e not entailed by) the text.

TO DO: better Named Entity classification TO DO: add lemmatization

class nltk.classify.rte_classify.RTEFeatureExtractor(rtepair, stop=True, use_lemmatize=False)[source]

Bases: object

This builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference.

hyp_extra(toktype, debug=True)[source]

Compute the extraneous material in the hypothesis.

Parameters:toktype ('ne' or 'word') – distinguish Named Entities from ordinary words
overlap(toktype, debug=False)[source]

Compute the overlap between text and hypothesis.

Parameters:toktype ('ne' or 'word') – distinguish Named Entities from ordinary words
nltk.classify.rte_classify.rte_classifier(algorithm)[source]
nltk.classify.rte_classify.rte_features(rtepair)[source]
nltk.classify.rte_classify.rte_featurize(rte_pairs)[source]

nltk.classify.scikitlearn module

scikit-learn (http://scikit-learn.org) is a machine learning library for Python. It supports many classification algorithms, including SVMs, Naive Bayes, logistic regression (MaxEnt) and decision trees.

This package implements a wrapper around scikit-learn classifiers. To use this wrapper, construct a scikit-learn estimator object, then use that to construct a SklearnClassifier. E.g., to wrap a linear SVM with default settings:

>>> from sklearn.svm import LinearSVC
>>> from nltk.classify.scikitlearn import SklearnClassifier
>>> classif = SklearnClassifier(LinearSVC())

A scikit-learn classifier may include preprocessing steps when it’s wrapped in a Pipeline object. The following constructs and wraps a Naive Bayes text classifier with tf-idf weighting and chi-square feature selection to get the best 1000 features:

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.pipeline import Pipeline
>>> pipeline = Pipeline([('tfidf', TfidfTransformer()),
...                      ('chi2', SelectKBest(chi2, k=1000)),
...                      ('nb', MultinomialNB())])
>>> classif = SklearnClassifier(pipeline)
class nltk.classify.scikitlearn.SklearnClassifier(estimator, dtype=<class 'float'>, sparse=True)[source]

Bases: nltk.classify.api.ClassifierI

Wrapper for scikit-learn classifiers.

classify_many(featuresets)[source]

Classify a batch of samples.

Parameters:featuresets – An iterable over featuresets, each a dict mapping strings to either numbers, booleans or strings.
Returns:The predicted class label for each input sample.
Return type:list
labels()[source]

The class labels used by this classifier.

Return type:list
prob_classify_many(featuresets)[source]

Compute per-class probabilities for a batch of samples.

Parameters:featuresets – An iterable over featuresets, each a dict mapping strings to either numbers, booleans or strings.
Return type:list of ProbDistI
train(labeled_featuresets)[source]

Train (fit) the scikit-learn estimator.

Parameters:labeled_featuresets – A list of (featureset, label) where each featureset is a dict mapping strings to either numbers, booleans or strings.
unicode_repr()

Return repr(self).

nltk.classify.senna module

A general interface to the SENNA pipeline that supports any of the operations specified in SUPPORTED_OPERATIONS.

Applying multiple operations at once has the speed advantage. For example, Senna will automatically determine POS tags if you are extracting named entities. Applying both of the operations will cost only the time of extracting the named entities.

The SENNA pipeline has a fixed maximum size of the sentences that it can read. By default it is 1024 token/sentence. If you have larger sentences, changing the MAX_SENTENCE_SIZE value in SENNA_main.c should be considered and your system specific binary should be rebuilt. Otherwise this could introduce misalignment errors.

The input is: - path to the directory that contains SENNA executables. If the path is incorrect,

Senna will automatically search for executable file specified in SENNA environment variable
  • List of the operations needed to be performed.
  • (optionally) the encoding of the input data (default:utf-8)

Note: Unit tests for this module can be found in test/unit/test_senna.py

>>> from __future__ import unicode_literals
>>> from nltk.classify import Senna
>>> pipeline = Senna('/usr/share/senna-v3.0', ['pos', 'chk', 'ner'])
>>> sent = 'Dusseldorf is an international business center'.split()
>>> [(token['word'], token['chk'], token['ner'], token['pos']) for token in pipeline.tag(sent)] 
[('Dusseldorf', 'B-NP', 'B-LOC', 'NNP'), ('is', 'B-VP', 'O', 'VBZ'), ('an', 'B-NP', 'O', 'DT'),
('international', 'I-NP', 'O', 'JJ'), ('business', 'I-NP', 'O', 'NN'), ('center', 'I-NP', 'O', 'NN')]
class nltk.classify.senna.Senna(senna_path, operations, encoding='utf-8')[source]

Bases: nltk.tag.api.TaggerI

SUPPORTED_OPERATIONS = ['pos', 'chk', 'ner']
executable(base_path)[source]

The function that determines the system specific binary that should be used in the pipeline. In case, the system is not known the default senna binary will be used.

tag(tokens)[source]

Applies the specified operation(s) on a list of tokens.

tag_sents(sentences)[source]

Applies the tag method over a list of sentences. This method will return a list of dictionaries. Every dictionary will contain a word with its calculated annotations/tags.

unicode_repr

Return repr(self).

nltk.classify.senna.setup_module(module)[source]

nltk.classify.svm module

nltk.classify.svm was deprecated. For classification based on support vector machines SVMs use nltk.classify.scikitlearn (or scikit-learn directly).

class nltk.classify.svm.SvmClassifier(*args, **kwargs)[source]

Bases: object

nltk.classify.tadm module

nltk.classify.tadm.call_tadm(args)[source]

Call the tadm binary with the given arguments.

nltk.classify.tadm.config_tadm(bin=None)[source]
nltk.classify.tadm.encoding_demo()[source]
nltk.classify.tadm.names_demo()[source]
nltk.classify.tadm.parse_tadm_weights(paramfile)[source]

Given the stdout output generated by tadm when training a model, return a numpy array containing the corresponding weight vector.

nltk.classify.tadm.write_tadm_file(train_toks, encoding, stream)[source]

Generate an input file for tadm based on the given corpus of classified tokens.

Parameters:
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
  • encoding (TadmEventMaxentFeatureEncoding) – A feature encoding, used to convert featuresets into feature vectors.
  • stream (stream) – The stream to which the tadm input file should be written.

nltk.classify.textcat module

A module for language identification using the TextCat algorithm. An implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, “N-Gram-Based Text Categorization”.

The algorithm takes advantage of Zipf’s law and uses n-gram frequencies to profile languages and text-yet to be identified-then compares using a distance measure.

Language n-grams are provided by the “An Crubadan” project. A corpus reader was created seperately to read those files.

For details regarding the algorithm, see: http://www.let.rug.nl/~vannoord/TextCat/textcat.pdf

For details about An Crubadan, see: http://borel.slu.edu/crubadan/index.html

class nltk.classify.textcat.TextCat[source]

Bases: object

calc_dist(lang, trigram, text_profile)[source]

Calculate the “out-of-place” measure between the text and language profile for a single trigram

fingerprints = {}
guess_language(text)[source]

Find the language with the min distance to the text and return its ISO 639-3 code

lang_dists(text)[source]

Calculate the “out-of-place” measure between the text and all languages

last_distances = {}
profile(text)[source]

Create FreqDist of trigrams within text

remove_punctuation(text)[source]

Get rid of punctuation except apostrophes

nltk.classify.textcat.demo()[source]

nltk.classify.util module

Utility functions and classes for classifiers.

class nltk.classify.util.CutoffChecker(cutoffs)[source]

Bases: object

A helper class that implements cutoff checks based on number of iterations and log likelihood.

Accuracy cutoffs are also implemented, but they’re almost never a good idea to use.

check(classifier, train_toks)[source]
nltk.classify.util.accuracy(classifier, gold)[source]
nltk.classify.util.apply_features(feature_func, toks, labeled=None)[source]

Use the LazyMap class to construct a lazy list-like object that is analogous to map(feature_func, toks). In particular, if labeled=False, then the returned list-like object’s values are equal to:

[feature_func(tok) for tok in toks]

If labeled=True, then the returned list-like object’s values are equal to:

[(feature_func(tok), label) for (tok, label) in toks]

The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).

Parameters:
  • feature_func – The function that will be applied to each token. It should return a featureset – i.e., a dict mapping feature names to feature values.
  • toks – The list of tokens to which feature_func should be applied. If labeled=True, then the list elements will be passed directly to feature_func(). If labeled=False, then the list elements should be tuples (tok,label), and tok will be passed to feature_func().
  • labeled – If true, then toks contains labeled tokens – i.e., tuples of the form (tok, label). (Default: auto-detect based on types.)
nltk.classify.util.attested_labels(tokens)[source]
Returns:A list of all labels that are attested in the given list of tokens.
Return type:list of (immutable)
Parameters:tokens (list) – The list of classified tokens from which to extract labels. A classified token has the form (token, label).
nltk.classify.util.binary_names_demo_features(name)[source]
nltk.classify.util.check_megam_config()[source]

Checks whether the MEGAM binary is configured.

nltk.classify.util.log_likelihood(classifier, gold)[source]
nltk.classify.util.names_demo(trainer, features=<function names_demo_features>)[source]
nltk.classify.util.names_demo_features(name)[source]
nltk.classify.util.partial_names_demo(trainer, features=<function names_demo_features>)[source]
nltk.classify.util.wsd_demo(trainer, word, features, n=1000)[source]

nltk.classify.weka module

Classifiers that make use of the external ‘Weka’ package.

class nltk.classify.weka.ARFF_Formatter(labels, features)[source]

Bases: object

Converts featuresets and labeled featuresets to ARFF-formatted strings, appropriate for input into Weka.

Features and classes can be specified manually in the constructor, or may be determined from data using from_train.

data_section(tokens, labeled=None)[source]

Returns the ARFF data section for the given data.

Parameters:
  • tokens – a list of featuresets (dicts) or labelled featuresets which are tuples (featureset, label).
  • labeled – Indicates whether the given tokens are labeled or not. If None, then the tokens will be assumed to be labeled if the first token’s value is a tuple or list.
format(tokens)[source]

Returns a string representation of ARFF output for the given data.

static from_train(tokens)[source]

Constructs an ARFF_Formatter instance with class labels and feature types determined from the given data. Handles boolean, numeric and string (note: not nominal) types.

header_section()[source]

Returns an ARFF header as a string.

labels()[source]

Returns the list of classes.

write(outfile, tokens)[source]

Writes ARFF data to a file for the given data.

class nltk.classify.weka.WekaClassifier(formatter, model_filename)[source]

Bases: nltk.classify.api.ClassifierI

classify_many(featuresets)[source]

Apply self.classify() to each element of featuresets. I.e.:

return [self.classify(fs) for fs in featuresets]
Return type:list(label)
parse_weka_distribution(s)[source]
parse_weka_output(lines)[source]
prob_classify_many(featuresets)[source]

Apply self.prob_classify() to each element of featuresets. I.e.:

return [self.prob_classify(fs) for fs in featuresets]
Return type:list(ProbDistI)
classmethod train(model_filename, featuresets, classifier='naivebayes', options=[], quiet=True)[source]
nltk.classify.weka.config_weka(classpath=None)[source]

Module contents

Classes and interfaces for labeling tokens with category labels (or “class labels”). Typically, labels are represented with strings (such as 'health' or 'sports'). Classifiers can be used to perform a wide range of classification tasks. For example, classifiers can be used…

  • to classify documents by topic
  • to classify ambiguous words by which word sense is intended
  • to classify acoustic signals by which phoneme they represent
  • to classify sentences by their author

Features

In order to decide which category label is appropriate for a given token, classifiers examine one or more ‘features’ of the token. These “features” are typically chosen by hand, and indicate which aspects of the token are relevant to the classification decision. For example, a document classifier might use a separate feature for each word, recording how often that word occurred in the document.

Featuresets

The features describing a token are encoded using a “featureset”, which is a dictionary that maps from “feature names” to “feature values”. Feature names are unique strings that indicate what aspect of the token is encoded by the feature. Examples include 'prevword', for a feature whose value is the previous word; and 'contains-word(library)' for a feature that is true when a document contains the word 'library'. Feature values are typically booleans, numbers, or strings, depending on which feature they describe.

Featuresets are typically constructed using a “feature detector” (also known as a “feature extractor”). A feature detector is a function that takes a token (and sometimes information about its context) as its input, and returns a featureset describing that token. For example, the following feature detector converts a document (stored as a list of words) to a featureset describing the set of words included in the document:

>>> # Define a feature detector function.
>>> def document_features(document):
...     return dict([('contains-word(%s)' % w, True) for w in document])

Feature detectors are typically applied to each token before it is fed to the classifier:

>>> # Classify each Gutenberg document.
>>> from nltk.corpus import gutenberg
>>> for fileid in gutenberg.fileids(): 
...     doc = gutenberg.words(fileid) 
...     print fileid, classifier.classify(document_features(doc)) 

The parameters that a feature detector expects will vary, depending on the task and the needs of the feature detector. For example, a feature detector for word sense disambiguation (WSD) might take as its input a sentence, and the index of a word that should be classified, and return a featureset for that word. The following feature detector for WSD includes features describing the left and right contexts of the target word:

>>> def wsd_features(sentence, index):
...     featureset = {}
...     for i in range(max(0, index-3), index):
...         featureset['left-context(%s)' % sentence[i]] = True
...     for i in range(index, max(index+3, len(sentence))):
...         featureset['right-context(%s)' % sentence[i]] = True
...     return featureset

Training Classifiers

Most classifiers are built by training them on a list of hand-labeled examples, known as the “training set”. Training sets are represented as lists of (featuredict, label) tuples.