nltk.classify package¶
Submodules¶
nltk.classify.api module¶
Interfaces for labeling tokens with category labels (or “class labels”).
ClassifierI
is a standard interface for “single-category
classification”, in which the set of categories is known, the number
of categories is finite, and each text belongs to exactly one
category.
MultiClassifierI
is a standard interface for “multi-category
classification”, which is like single-category classification except
that each text belongs to zero or more categories.
-
class
nltk.classify.api.
ClassifierI
[source]¶ Bases:
object
A processing interface for labeling tokens with a single category label (or “class”). Labels are typically strs or ints, but can be any immutable type. The set of labels that the classifier chooses from must be fixed and finite.
- Subclasses must define:
labels()
- either
classify()
orclassify_many()
(or both)
- Subclasses may define:
- either
prob_classify()
orprob_classify_many()
(or both)
- either
-
classify
(featureset)[source]¶ Returns: the most appropriate label for the given featureset. Return type: label
-
classify_many
(featuresets)[source]¶ Apply
self.classify()
to each element offeaturesets
. I.e.:return [self.classify(fs) for fs in featuresets]Return type: list(label)
-
labels
()[source]¶ Returns: the list of category labels used by this classifier. Return type: list of (immutable)
-
class
nltk.classify.api.
MultiClassifierI
[source]¶ Bases:
object
A processing interface for labeling tokens with zero or more category labels (or “labels”). Labels are typically strs or ints, but can be any immutable type. The set of labels that the multi-classifier chooses from must be fixed and finite.
- Subclasses must define:
labels()
- either
classify()
orclassify_many()
(or both)
- Subclasses may define:
- either
prob_classify()
orprob_classify_many()
(or both)
- either
-
classify
(featureset)[source]¶ Returns: the most appropriate set of labels for the given featureset. Return type: set(label)
-
classify_many
(featuresets)[source]¶ Apply
self.classify()
to each element offeaturesets
. I.e.:return [self.classify(fs) for fs in featuresets]Return type: list(set(label))
-
labels
()[source]¶ Returns: the list of category labels used by this classifier. Return type: list of (immutable)
nltk.classify.decisiontree module¶
A classifier model that decides which label to assign to a token on the basis of a tree structure, where branches correspond to conditions on feature values, and leaves correspond to label assignments.
-
class
nltk.classify.decisiontree.
DecisionTreeClassifier
(label, feature_name=None, decisions=None, default=None)[source]¶ Bases:
nltk.classify.api.ClassifierI
-
static
best_binary_stump
(feature_names, labeled_featuresets, feature_values, verbose=False)[source]¶
-
classify
(featureset)[source]¶ Returns: the most appropriate label for the given featureset. Return type: label
-
labels
()[source]¶ Returns: the list of category labels used by this classifier. Return type: list of (immutable)
-
pretty_format
(width=70, prefix='', depth=4)[source]¶ Return a string containing a pretty-printed version of this decision tree. Each line in this string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the decision tree.
-
pseudocode
(prefix='', depth=4)[source]¶ Return a string representation of this decision tree that expresses the decisions it makes as a nested set of pseudocode if statements.
-
refine
(labeled_featuresets, entropy_cutoff, depth_cutoff, support_cutoff, binary=False, feature_values=None, verbose=False)[source]¶
-
static
train
(labeled_featuresets, entropy_cutoff=0.05, depth_cutoff=100, support_cutoff=10, binary=False, feature_values=None, verbose=False)[source]¶ Parameters: binary – If true, then treat all feature/value pairs as individual binary features, rather than using a single n-way branch for each feature.
-
unicode_repr
¶ Return repr(self).
-
static
nltk.classify.maxent module¶
A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy. A probability distribution is “empirically consistent” with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data.
Terminology: ‘feature’¶
The term feature is usually used to refer to some property of an
unlabeled token. For example, when performing word sense
disambiguation, we might define a 'prevword'
feature whose value is
the word preceding the target word. However, in the context of
maxent modeling, the term feature is typically used to refer to a
property of a “labeled” token. In order to prevent confusion, we
will introduce two distinct terms to disambiguate these two different
concepts:
- An “input-feature” is a property of an unlabeled token.
- A “joint-feature” is a property of a labeled token.
In the rest of the nltk.classify
module, the term “features” is
used to refer to what we will call “input-features” in this module.
In literature that describes and discusses maximum entropy models, input-features are typically called “contexts”, and joint-features are simply referred to as “features”.
Converting Input-Features to Joint-Features¶
In maximum entropy models, joint-features are required to have numeric
values. Typically, each input-feature input_feat
is mapped to a
set of joint-features of the form:
For all values of feat_val
and some_label
. This mapping is
performed by classes that implement the MaxentFeatureEncodingI
interface.
-
class
nltk.classify.maxent.
BinaryMaxentFeatureEncoding
(labels, mapping, unseen_features=False, alwayson_features=False)[source]¶ Bases:
nltk.classify.maxent.MaxentFeatureEncodingI
A feature encoding that generates vectors containing a binary joint-features of the form:
joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label){{ 0 otherwiseWhere
fname
is the name of an input-feature,fval
is a value for that input-feature, andlabel
is a label.Typically, these features are constructed based on a training corpus, using the
train()
method. This method will create one feature for each combination offname
,fval
, andlabel
that occurs at least once in the training corpus.The
unseen_features
parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname]){ and l == label{{ 0 otherwiseWhere
is_unseen(fname, fval)
is true if the encoding does not contain any joint features that are true whenfs[fname]==fval
.The
alwayson_features
parameter can be used to add “always-on features”, which have the form:| joint_feat(fs, l) = { 1 if (l == label) | { | { 0 otherwise
These always-on features allow the maxent model to directly model the prior probabilities of each label.
-
describe
(f_id)[source]¶ Returns: A string describing the value of the joint-feature whose index in the generated feature vectors is fid
.Return type: str
-
encode
(featureset, label)[source]¶ Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.Return type: list(tuple(int, int))
-
labels
()[source]¶ Returns: A list of the “known labels” – i.e., all labels l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.Return type: list
-
length
()[source]¶ Returns: The size of the fixed-length joint-feature vectors that are generated by this encoding. Return type: int
-
classmethod
train
(train_toks, count_cutoff=0, labels=None, **options)[source]¶ Construct and return new feature encoding, based on a given training corpus
train_toks
. See the class descriptionBinaryMaxentFeatureEncoding
for a description of the joint-features that will be included in this encoding.Parameters: - train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
- count_cutoff (int) – A cutoff value that is used to discard
rare joint-features. If a joint-feature’s value is 1
fewer than
count_cutoff
times in the training corpus, then that joint-feature is not included in the generated encoding. - labels (list) – A list of labels that should be used by the
classifier. If not specified, then the set of labels
attested in
train_toks
will be used. - options – Extra parameters for the constructor, such as
unseen_features
andalwayson_features
.
-
-
nltk.classify.maxent.
ConditionalExponentialClassifier
¶ Alias for MaxentClassifier.
-
class
nltk.classify.maxent.
FunctionBackedMaxentFeatureEncoding
(func, length, labels)[source]¶ Bases:
nltk.classify.maxent.MaxentFeatureEncodingI
A feature encoding that calls a user-supplied function to map a given featureset/label pair to a sparse joint-feature vector.
-
describe
(fid)[source]¶ Returns: A string describing the value of the joint-feature whose index in the generated feature vectors is fid
.Return type: str
-
encode
(featureset, label)[source]¶ Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.Return type: list(tuple(int, int))
-
-
class
nltk.classify.maxent.
GISEncoding
(labels, mapping, unseen_features=False, alwayson_features=False, C=None)[source]¶ Bases:
nltk.classify.maxent.BinaryMaxentFeatureEncoding
A binary feature encoding which adds one new joint-feature to the joint-features defined by
BinaryMaxentFeatureEncoding
: a correction feature, whose value is chosen to ensure that the sparse vector always sums to a constant non-negative number. This new feature is used to ensure two preconditions for the GIS training algorithm:- At least one feature vector index must be nonzero for every token.
- The feature vector must sum to a constant non-negative number for every token.
-
C
¶ The non-negative constant that all encoded feature vectors will sum to.
-
describe
(f_id)[source]¶ Returns: A string describing the value of the joint-feature whose index in the generated feature vectors is fid
.Return type: str
-
class
nltk.classify.maxent.
MaxentClassifier
(encoding, weights, logarithmic=True)[source]¶ Bases:
nltk.classify.api.ClassifierI
A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each
(featureset, label)
pair to a vector. The probability of each label is then computed using the following equation:dotprod(weights, encode(fs,label)) prob(fs|label) = --------------------------------------------------- sum(dotprod(weights, encode(fs,l)) for l in labels)
Where
dotprod
is the dot product:dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))
-
ALGORITHMS
= ['GIS', 'IIS', 'MEGAM', 'TADM']¶ A list of the algorithm names that are accepted for the
train()
method’salgorithm
parameter.
-
classify
(featureset)[source]¶ Returns: the most appropriate label for the given featureset. Return type: label
-
explain
(featureset, columns=4)[source]¶ Print a table showing the effect of each of the features in the given feature set, and how they combine to determine the probabilities of each label for that featureset.
-
labels
()[source]¶ Returns: the list of category labels used by this classifier. Return type: list of (immutable)
-
most_informative_features
(n=10)[source]¶ Generates the ranked list of informative features from most to least.
-
prob_classify
(featureset)[source]¶ Returns: a probability distribution over labels for the given featureset. Return type: ProbDistI
-
set_weights
(new_weights)[source]¶ Set the feature weight vector for this classifier. :param new_weights: The new feature weight vector. :type new_weights: list of float
-
show_most_informative_features
(n=10, show='all')[source]¶ Parameters: - show (str) – all, neg, or pos (for negative-only or positive-only)
- n (int) – The no. of top features
-
classmethod
train
(train_toks, algorithm=None, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **cutoffs)[source]¶ Train a new maxent classifier based on the given corpus of training samples. This classifier will have its weights chosen to maximize entropy while remaining empirically consistent with the training corpus.
Return type: Returns: The new maxent classifier
Parameters: - train_toks (list) – Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.
- algorithm (str) –
A case-insensitive string, specifying which algorithm should be used to train the classifier. The following algorithms are currently available.
- Iterative Scaling Methods: Generalized Iterative Scaling (
'GIS'
), Improved Iterative Scaling ('IIS'
) - External Libraries (requiring megam):
LM-BFGS algorithm, with training performed by Megam (
'megam'
)
The default algorithm is
'IIS'
. - Iterative Scaling Methods: Generalized Iterative Scaling (
- trace (int) – The level of diagnostic tracing output to produce. Higher values produce more verbose output.
- encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets
into feature vectors. If none is specified, then a
BinaryMaxentFeatureEncoding
will be built based on the features that are attested in the training corpus. - labels (list(str)) – The set of possible labels. If none is given, then the set of all labels attested in the training data will be used instead.
- gaussian_prior_sigma – The sigma value for a gaussian
prior on model weights. Currently, this is supported by
megam
. For other algorithms, its value is ignored. - cutoffs –
Arguments specifying various conditions under which the training should be halted. (Some of the cutoff conditions are not supported by some algorithms.)
max_iter=v
: Terminate afterv
iterations.min_ll=v
: Terminate after the negative average log-likelihood drops underv
.min_lldelta=v
: Terminate if a single iteration improves log likelihood by less thanv
.
-
unicode_repr
()¶ Return repr(self).
-
-
class
nltk.classify.maxent.
MaxentFeatureEncodingI
[source]¶ Bases:
object
A mapping that converts a set of input-feature values to a vector of joint-feature values, given a label. This conversion is necessary to translate featuresets into a format that can be used by maximum entropy models.
The set of joint-features used by a given encoding is fixed, and each index in the generated joint-feature vectors corresponds to a single joint-feature. The length of the generated joint-feature vectors is therefore constant (for a given encoding).
Because the joint-feature vectors generated by
MaxentFeatureEncodingI
are typically very sparse, they are represented as a list of(index, value)
tuples, specifying the value of each non-zero joint-feature.Feature encodings are generally created using the
train()
method, which generates an appropriate encoding based on the input-feature values and labels that are present in a given corpus.-
describe
(fid)[source]¶ Returns: A string describing the value of the joint-feature whose index in the generated feature vectors is fid
.Return type: str
-
encode
(featureset, label)[source]¶ Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.Return type: list(tuple(int, int))
-
labels
()[source]¶ Returns: A list of the “known labels” – i.e., all labels l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.Return type: list
-
length
()[source]¶ Returns: The size of the fixed-length joint-feature vectors that are generated by this encoding. Return type: int
-
train
(train_toks)[source]¶ Construct and return new feature encoding, based on a given training corpus
train_toks
.Parameters: train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
-
-
class
nltk.classify.maxent.
TadmEventMaxentFeatureEncoding
(labels, mapping, unseen_features=False, alwayson_features=False)[source]¶ Bases:
nltk.classify.maxent.BinaryMaxentFeatureEncoding
-
describe
(fid)[source]¶ Returns: A string describing the value of the joint-feature whose index in the generated feature vectors is fid
.Return type: str
-
encode
(featureset, label)[source]¶ Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.Return type: list(tuple(int, int))
-
labels
()[source]¶ Returns: A list of the “known labels” – i.e., all labels l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.Return type: list
-
length
()[source]¶ Returns: The size of the fixed-length joint-feature vectors that are generated by this encoding. Return type: int
-
classmethod
train
(train_toks, count_cutoff=0, labels=None, **options)[source]¶ Construct and return new feature encoding, based on a given training corpus
train_toks
. See the class descriptionBinaryMaxentFeatureEncoding
for a description of the joint-features that will be included in this encoding.Parameters: - train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
- count_cutoff (int) – A cutoff value that is used to discard
rare joint-features. If a joint-feature’s value is 1
fewer than
count_cutoff
times in the training corpus, then that joint-feature is not included in the generated encoding. - labels (list) – A list of labels that should be used by the
classifier. If not specified, then the set of labels
attested in
train_toks
will be used. - options – Extra parameters for the constructor, such as
unseen_features
andalwayson_features
.
-
-
class
nltk.classify.maxent.
TadmMaxentClassifier
(encoding, weights, logarithmic=True)[source]¶ Bases:
nltk.classify.maxent.MaxentClassifier
-
classmethod
train
(train_toks, **kwargs)[source]¶ Train a new maxent classifier based on the given corpus of training samples. This classifier will have its weights chosen to maximize entropy while remaining empirically consistent with the training corpus.
Return type: Returns: The new maxent classifier
Parameters: - train_toks (list) – Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.
- algorithm (str) –
A case-insensitive string, specifying which algorithm should be used to train the classifier. The following algorithms are currently available.
- Iterative Scaling Methods: Generalized Iterative Scaling (
'GIS'
), Improved Iterative Scaling ('IIS'
) - External Libraries (requiring megam):
LM-BFGS algorithm, with training performed by Megam (
'megam'
)
The default algorithm is
'IIS'
. - Iterative Scaling Methods: Generalized Iterative Scaling (
- trace (int) – The level of diagnostic tracing output to produce. Higher values produce more verbose output.
- encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets
into feature vectors. If none is specified, then a
BinaryMaxentFeatureEncoding
will be built based on the features that are attested in the training corpus. - labels (list(str)) – The set of possible labels. If none is given, then the set of all labels attested in the training data will be used instead.
- gaussian_prior_sigma – The sigma value for a gaussian
prior on model weights. Currently, this is supported by
megam
. For other algorithms, its value is ignored. - cutoffs –
Arguments specifying various conditions under which the training should be halted. (Some of the cutoff conditions are not supported by some algorithms.)
max_iter=v
: Terminate afterv
iterations.min_ll=v
: Terminate after the negative average log-likelihood drops underv
.min_lldelta=v
: Terminate if a single iteration improves log likelihood by less thanv
.
-
classmethod
-
class
nltk.classify.maxent.
TypedMaxentFeatureEncoding
(labels, mapping, unseen_features=False, alwayson_features=False)[source]¶ Bases:
nltk.classify.maxent.MaxentFeatureEncodingI
A feature encoding that generates vectors containing integer, float and binary joint-features of the form:
Binary (for string and boolean features):
joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label){{ 0 otherwiseValue (for integer and float features):
joint_feat(fs, l) = { fval if (fs[fname] == type(fval)){ and (l == label){{ not encoded otherwiseWhere
fname
is the name of an input-feature,fval
is a value for that input-feature, andlabel
is a label.Typically, these features are constructed based on a training corpus, using the
train()
method.For string and boolean features [type(fval) not in (int, float)] this method will create one feature for each combination of
fname
,fval
, andlabel
that occurs at least once in the training corpus.For integer and float features [type(fval) in (int, float)] this method will create one feature for each combination of
fname
andlabel
that occurs at least once in the training corpus.For binary features the
unseen_features
parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname]){ and l == label{{ 0 otherwiseWhere
is_unseen(fname, fval)
is true if the encoding does not contain any joint features that are true whenfs[fname]==fval
.The
alwayson_features
parameter can be used to add “always-on features”, which have the form:joint_feat(fs, l) = { 1 if (l == label){{ 0 otherwiseThese always-on features allow the maxent model to directly model the prior probabilities of each label.
-
describe
(f_id)[source]¶ Returns: A string describing the value of the joint-feature whose index in the generated feature vectors is fid
.Return type: str
-
encode
(featureset, label)[source]¶ Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.Return type: list(tuple(int, int))
-
labels
()[source]¶ Returns: A list of the “known labels” – i.e., all labels l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.Return type: list
-
length
()[source]¶ Returns: The size of the fixed-length joint-feature vectors that are generated by this encoding. Return type: int
-
classmethod
train
(train_toks, count_cutoff=0, labels=None, **options)[source]¶ Construct and return new feature encoding, based on a given training corpus
train_toks
. See the class descriptionTypedMaxentFeatureEncoding
for a description of the joint-features that will be included in this encoding.Note: recognized feature values types are (int, float), over types are interpreted as regular binary features.
Parameters: - train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
- count_cutoff (int) – A cutoff value that is used to discard
rare joint-features. If a joint-feature’s value is 1
fewer than
count_cutoff
times in the training corpus, then that joint-feature is not included in the generated encoding. - labels (list) – A list of labels that should be used by the
classifier. If not specified, then the set of labels
attested in
train_toks
will be used. - options – Extra parameters for the constructor, such as
unseen_features
andalwayson_features
.
-
-
nltk.classify.maxent.
calculate_deltas
(train_toks, classifier, unattested, ffreq_empirical, nfmap, nfarray, nftranspose, encoding)[source]¶ Calculate the update values for the classifier weights for this iteration of IIS. These update weights are the value of
delta
that solves the equation:ffreq_empirical[i] = SUM[fs,l] (classifier.prob_classify(fs).prob(l) * feature_vector(fs,l)[i] * exp(delta[i] * nf(feature_vector(fs,l))))
- Where:
- (fs,l) is a (featureset, label) tuple from
train_toks
- feature_vector(fs,l) =
encoding.encode(fs,l)
- nf(vector) =
sum([val for (id,val) in vector])
- (fs,l) is a (featureset, label) tuple from
This method uses Newton’s method to solve this equation for delta[i]. In particular, it starts with a guess of
delta[i]
= 1; and iteratively updatesdelta
with:delta[i] -= (ffreq_empirical[i] - sum1[i])/(-sum2[i])until convergence, where sum1 and sum2 are defined as:
sum1[i](delta) = SUM[fs,l] f[i](fs,l,delta)sum2[i](delta) = SUM[fs,l] (f[i](fs,l,delta).nf(feature_vector(fs,l)))f[i](fs,l,delta) = (classifier.prob_classify(fs).prob(l) .feature_vector(fs,l)[i] .exp(delta[i] . nf(feature_vector(fs,l))))Note that sum1 and sum2 depend on
delta
; so they need to be re-computed each iteration.The variables
nfmap
,nfarray
, andnftranspose
are used to generate a dense encoding for nf(ltext). This allows_deltas
to calculate sum1 and sum2 using matrices, which yields a significant performance improvement.Parameters: - train_toks (list(tuple(dict, str))) – The set of training tokens.
- classifier (ClassifierI) – The current classifier.
- ffreq_empirical (sequence of float) – An array containing the empirical frequency for each feature. The ith element of this array is the empirical frequency for feature i.
- unattested (sequence of int) – An array that is 1 for features that are
not attested in the training data; and 0 for features that
are attested. In other words,
unattested[i]==0
iffffreq_empirical[i]==0
. - nfmap (dict(int -> int)) – A map that can be used to compress
nf
to a dense vector. - nfarray (array(float)) – An array that can be used to uncompress
nf
from a dense vector. - nftranspose (array(float)) – The transpose of
nfarray
-
nltk.classify.maxent.
calculate_nfmap
(train_toks, encoding)[source]¶ Construct a map that can be used to compress
nf
(which is typically sparse).nf(feature_vector) is the sum of the feature values for feature_vector.
This represents the number of features that are active for a given labeled text. This method finds all values of nf(t) that are attested for at least one token in the given list of training tokens; and constructs a dictionary mapping these attested values to a continuous range 0…N. For example, if the only values of nf() that were attested were 3, 5, and 7, then
_nfmap
might return the dictionary{3:0, 5:1, 7:2}
.Returns: A map that can be used to compress nf
to a dense vector.Return type: dict(int -> int)
-
nltk.classify.maxent.
train_maxent_classifier_with_gis
(train_toks, trace=3, encoding=None, labels=None, **cutoffs)[source]¶ Train a new
ConditionalExponentialClassifier
, using the given training samples, using the Generalized Iterative Scaling algorithm. ThisConditionalExponentialClassifier
will encode the model that maximizes entropy from all the models that are empirically consistent withtrain_toks
.See: train_maxent_classifier()
for parameter descriptions.
-
nltk.classify.maxent.
train_maxent_classifier_with_iis
(train_toks, trace=3, encoding=None, labels=None, **cutoffs)[source]¶ Train a new
ConditionalExponentialClassifier
, using the given training samples, using the Improved Iterative Scaling algorithm. ThisConditionalExponentialClassifier
will encode the model that maximizes entropy from all the models that are empirically consistent withtrain_toks
.See: train_maxent_classifier()
for parameter descriptions.
-
nltk.classify.maxent.
train_maxent_classifier_with_megam
(train_toks, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **kwargs)[source]¶ Train a new
ConditionalExponentialClassifier
, using the given training samples, using the externalmegam
library. ThisConditionalExponentialClassifier
will encode the model that maximizes entropy from all the models that are empirically consistent withtrain_toks
.See: train_maxent_classifier()
for parameter descriptions.See: nltk.classify.megam
nltk.classify.megam module¶
A set of functions used to interface with the external megam maxent
optimization package. Before megam can be used, you should tell NLTK where it
can find the megam binary, using the config_megam()
function. Typical
usage:
>>> from nltk.classify import megam
>>> megam.config_megam() # pass path to megam if not found in PATH
[Found megam: ...]
Use with MaxentClassifier. Example below, see MaxentClassifier documentation for details.
nltk.classify.MaxentClassifier.train(corpus, ‘megam’)
-
nltk.classify.megam.
config_megam
(bin=None)[source]¶ Configure NLTK’s interface to the
megam
maxent optimization package.Parameters: bin (str) – The full path to the megam
binary. If not specified, then nltk will search the system for amegam
binary; and if one is not found, it will raise aLookupError
exception.
-
nltk.classify.megam.
parse_megam_weights
(s, features_count, explicit=True)[source]¶ Given the stdout output generated by
megam
when training a model, return anumpy
array containing the corresponding weight vector. This function does not currently handle bias features.
-
nltk.classify.megam.
write_megam_file
(train_toks, encoding, stream, bernoulli=True, explicit=True)[source]¶ Generate an input file for
megam
based on the given corpus of classified tokens.Parameters: - train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
- encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets into feature vectors. May optionally implement a cost() method in order to assign different costs to different class predictions.
- stream (stream) – The stream to which the megam input file should be written.
- bernoulli – If true, then use the ‘bernoulli’ format. I.e.,
all joint features have binary values, and are listed iff they
are true. Otherwise, list feature values explicitly. If
bernoulli=False
, then you must callmegam
with the-fvals
option. - explicit – If true, then use the ‘explicit’ format. I.e.,
list the features that would fire for any of the possible
labels, for each token. If
explicit=True
, then you must callmegam
with the-explicit
option.
nltk.classify.naivebayes module¶
A classifier based on the Naive Bayes algorithm. In order to find the probability for a label, this algorithm first uses the Bayes rule to express P(label|features) in terms of P(label) and P(features|label):
The algorithm then makes the ‘naive’ assumption that all features are independent, given the label:
Rather than computing P(features) explicitly, the algorithm just calculates the numerator for each label, and normalizes them so they sum to one:
-
class
nltk.classify.naivebayes.
NaiveBayesClassifier
(label_probdist, feature_probdist)[source]¶ Bases:
nltk.classify.api.ClassifierI
A Naive Bayes classifier. Naive Bayes classifiers are paramaterized by two probability distributions:
- P(label) gives the probability that an input will receive each label, given no information about the input’s features.
- P(fname=fval|label) gives the probability that a given feature (fname) will receive a given value (fval), given that the label (label).
If the classifier encounters an input with a feature that has never been seen with any label, then rather than assigning a probability of 0 to all labels, it will ignore that feature.
The feature value ‘None’ is reserved for unseen feature values; you generally should not use ‘None’ as a feature value for one of your own features.
-
classify
(featureset)[source]¶ Returns: the most appropriate label for the given featureset. Return type: label
-
labels
()[source]¶ Returns: the list of category labels used by this classifier. Return type: list of (immutable)
-
most_informative_features
(n=100)[source]¶ Return a list of the ‘most informative’ features used by this classifier. For the purpose of this function, the informativeness of a feature
(fname,fval)
is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:max[ P(fname=fval|label1) / P(fname=fval|label2) ]
nltk.classify.positivenaivebayes module¶
A variant of the Naive Bayes Classifier that performs binary classification with partially-labeled training sets. In other words, assume we want to build a classifier that assigns each example to one of two complementary classes (e.g., male names and female names). If we have a training set with labeled examples for both classes, we can use a standard Naive Bayes Classifier. However, consider the case when we only have labeled examples for one of the classes, and other, unlabeled, examples. Then, assuming a prior distribution on the two labels, we can use the unlabeled set to estimate the frequencies of the various features.
Let the two possible labels be 1 and 0, and let’s say we only have examples labeled 1 and unlabeled examples. We are also given an estimate of P(1).
We compute P(feature|1) exactly as in the standard case.
To compute P(feature|0), we first estimate P(feature) from the unlabeled set (we are assuming that the unlabeled examples are drawn according to the given prior distribution) and then express the conditional probability as:
Example:
>>> from nltk.classify import PositiveNaiveBayesClassifier
Some sentences about sports:
>>> sports_sentences = [ 'The team dominated the game',
... 'They lost the ball',
... 'The game was intense',
... 'The goalkeeper catched the ball',
... 'The other team controlled the ball' ]
Mixed topics, including sports:
>>> various_sentences = [ 'The President did not comment',
... 'I lost the keys',
... 'The team won the game',
... 'Sara has two kids',
... 'The ball went off the court',
... 'They had the ball for the whole game',
... 'The show is over' ]
The features of a sentence are simply the words it contains:
>>> def features(sentence):
... words = sentence.lower().split()
... return dict(('contains(%s)' % w, True) for w in words)
We use the sports sentences as positive examples, the mixed ones ad unlabeled examples:
>>> positive_featuresets = list(map(features, sports_sentences))
>>> unlabeled_featuresets = list(map(features, various_sentences))
>>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
... unlabeled_featuresets)
Is the following sentence about sports?
>>> classifier.classify(features('The cat is on the table'))
False
What about this one?
>>> classifier.classify(features('My team lost the game'))
True
-
class
nltk.classify.positivenaivebayes.
PositiveNaiveBayesClassifier
(label_probdist, feature_probdist)[source]¶ Bases:
nltk.classify.naivebayes.NaiveBayesClassifier
-
static
train
(positive_featuresets, unlabeled_featuresets, positive_prob_prior=0.5, estimator=<class 'nltk.probability.ELEProbDist'>)[source]¶ Parameters: - positive_featuresets – A list of featuresets that are known as positive
examples (i.e., their label is
True
). - unlabeled_featuresets – A list of featuresets whose label is unknown.
- positive_prob_prior – A prior estimate of the probability of the label
True
(default 0.5).
- positive_featuresets – A list of featuresets that are known as positive
examples (i.e., their label is
-
static
nltk.classify.rte_classify module¶
Simple classifier for RTE corpus.
It calculates the overlap in words and named entities between text and hypothesis, and also whether there are words / named entities in the hypothesis which fail to occur in the text, since this is an indicator that the hypothesis is more informative than (i.e not entailed by) the text.
TO DO: better Named Entity classification TO DO: add lemmatization
-
class
nltk.classify.rte_classify.
RTEFeatureExtractor
(rtepair, stop=True, use_lemmatize=False)[source]¶ Bases:
object
This builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference.
nltk.classify.scikitlearn module¶
scikit-learn (http://scikit-learn.org) is a machine learning library for Python. It supports many classification algorithms, including SVMs, Naive Bayes, logistic regression (MaxEnt) and decision trees.
This package implements a wrapper around scikit-learn classifiers. To use this wrapper, construct a scikit-learn estimator object, then use that to construct a SklearnClassifier. E.g., to wrap a linear SVM with default settings:
>>> from sklearn.svm import LinearSVC
>>> from nltk.classify.scikitlearn import SklearnClassifier
>>> classif = SklearnClassifier(LinearSVC())
A scikit-learn classifier may include preprocessing steps when it’s wrapped in a Pipeline object. The following constructs and wraps a Naive Bayes text classifier with tf-idf weighting and chi-square feature selection to get the best 1000 features:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.pipeline import Pipeline
>>> pipeline = Pipeline([('tfidf', TfidfTransformer()),
... ('chi2', SelectKBest(chi2, k=1000)),
... ('nb', MultinomialNB())])
>>> classif = SklearnClassifier(pipeline)
-
class
nltk.classify.scikitlearn.
SklearnClassifier
(estimator, dtype=<class 'float'>, sparse=True)[source]¶ Bases:
nltk.classify.api.ClassifierI
Wrapper for scikit-learn classifiers.
-
classify_many
(featuresets)[source]¶ Classify a batch of samples.
Parameters: featuresets – An iterable over featuresets, each a dict mapping strings to either numbers, booleans or strings. Returns: The predicted class label for each input sample. Return type: list
-
prob_classify_many
(featuresets)[source]¶ Compute per-class probabilities for a batch of samples.
Parameters: featuresets – An iterable over featuresets, each a dict mapping strings to either numbers, booleans or strings. Return type: list of ProbDistI
-
train
(labeled_featuresets)[source]¶ Train (fit) the scikit-learn estimator.
Parameters: labeled_featuresets – A list of (featureset, label)
where eachfeatureset
is a dict mapping strings to either numbers, booleans or strings.
-
unicode_repr
()¶ Return repr(self).
-
nltk.classify.senna module¶
A general interface to the SENNA pipeline that supports any of the operations specified in SUPPORTED_OPERATIONS.
Applying multiple operations at once has the speed advantage. For example, Senna will automatically determine POS tags if you are extracting named entities. Applying both of the operations will cost only the time of extracting the named entities.
The SENNA pipeline has a fixed maximum size of the sentences that it can read. By default it is 1024 token/sentence. If you have larger sentences, changing the MAX_SENTENCE_SIZE value in SENNA_main.c should be considered and your system specific binary should be rebuilt. Otherwise this could introduce misalignment errors.
The input is: - path to the directory that contains SENNA executables. If the path is incorrect,
Senna will automatically search for executable file specified in SENNA environment variable
- List of the operations needed to be performed.
- (optionally) the encoding of the input data (default:utf-8)
Note: Unit tests for this module can be found in test/unit/test_senna.py
>>> from __future__ import unicode_literals
>>> from nltk.classify import Senna
>>> pipeline = Senna('/usr/share/senna-v3.0', ['pos', 'chk', 'ner'])
>>> sent = 'Dusseldorf is an international business center'.split()
>>> [(token['word'], token['chk'], token['ner'], token['pos']) for token in pipeline.tag(sent)]
[('Dusseldorf', 'B-NP', 'B-LOC', 'NNP'), ('is', 'B-VP', 'O', 'VBZ'), ('an', 'B-NP', 'O', 'DT'),
('international', 'I-NP', 'O', 'JJ'), ('business', 'I-NP', 'O', 'NN'), ('center', 'I-NP', 'O', 'NN')]
-
class
nltk.classify.senna.
Senna
(senna_path, operations, encoding='utf-8')[source]¶ Bases:
nltk.tag.api.TaggerI
-
SUPPORTED_OPERATIONS
= ['pos', 'chk', 'ner']¶
-
executable
(base_path)[source]¶ The function that determines the system specific binary that should be used in the pipeline. In case, the system is not known the default senna binary will be used.
-
tag_sents
(sentences)[source]¶ Applies the tag method over a list of sentences. This method will return a list of dictionaries. Every dictionary will contain a word with its calculated annotations/tags.
-
unicode_repr
¶ Return repr(self).
-
nltk.classify.svm module¶
nltk.classify.svm was deprecated. For classification based on support vector machines SVMs use nltk.classify.scikitlearn (or scikit-learn directly).
nltk.classify.tadm module¶
-
nltk.classify.tadm.
parse_tadm_weights
(paramfile)[source]¶ Given the stdout output generated by
tadm
when training a model, return anumpy
array containing the corresponding weight vector.
-
nltk.classify.tadm.
write_tadm_file
(train_toks, encoding, stream)[source]¶ Generate an input file for
tadm
based on the given corpus of classified tokens.Parameters: - train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
- encoding (TadmEventMaxentFeatureEncoding) – A feature encoding, used to convert featuresets into feature vectors.
- stream (stream) – The stream to which the
tadm
input file should be written.
nltk.classify.textcat module¶
A module for language identification using the TextCat algorithm. An implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, “N-Gram-Based Text Categorization”.
The algorithm takes advantage of Zipf’s law and uses n-gram frequencies to profile languages and text-yet to be identified-then compares using a distance measure.
Language n-grams are provided by the “An Crubadan” project. A corpus reader was created seperately to read those files.
For details regarding the algorithm, see: http://www.let.rug.nl/~vannoord/TextCat/textcat.pdf
For details about An Crubadan, see: http://borel.slu.edu/crubadan/index.html
-
class
nltk.classify.textcat.
TextCat
[source]¶ Bases:
object
-
calc_dist
(lang, trigram, text_profile)[source]¶ Calculate the “out-of-place” measure between the text and language profile for a single trigram
-
fingerprints
= {}¶
-
guess_language
(text)[source]¶ Find the language with the min distance to the text and return its ISO 639-3 code
-
last_distances
= {}¶
-
nltk.classify.util module¶
Utility functions and classes for classifiers.
-
class
nltk.classify.util.
CutoffChecker
(cutoffs)[source]¶ Bases:
object
A helper class that implements cutoff checks based on number of iterations and log likelihood.
Accuracy cutoffs are also implemented, but they’re almost never a good idea to use.
-
nltk.classify.util.
apply_features
(feature_func, toks, labeled=None)[source]¶ Use the
LazyMap
class to construct a lazy list-like object that is analogous tomap(feature_func, toks)
. In particular, iflabeled=False
, then the returned list-like object’s values are equal to:[feature_func(tok) for tok in toks]
If
labeled=True
, then the returned list-like object’s values are equal to:[(feature_func(tok), label) for (tok, label) in toks]
The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).
Parameters: - feature_func – The function that will be applied to each token. It should return a featureset – i.e., a dict mapping feature names to feature values.
- toks – The list of tokens to which
feature_func
should be applied. Iflabeled=True
, then the list elements will be passed directly tofeature_func()
. Iflabeled=False
, then the list elements should be tuples(tok,label)
, andtok
will be passed tofeature_func()
. - labeled – If true, then
toks
contains labeled tokens – i.e., tuples of the form(tok, label)
. (Default: auto-detect based on types.)
nltk.classify.weka module¶
Classifiers that make use of the external ‘Weka’ package.
-
class
nltk.classify.weka.
ARFF_Formatter
(labels, features)[source]¶ Bases:
object
Converts featuresets and labeled featuresets to ARFF-formatted strings, appropriate for input into Weka.
Features and classes can be specified manually in the constructor, or may be determined from data using
from_train
.-
data_section
(tokens, labeled=None)[source]¶ Returns the ARFF data section for the given data.
Parameters: - tokens – a list of featuresets (dicts) or labelled featuresets which are tuples (featureset, label).
- labeled – Indicates whether the given tokens are labeled or not. If None, then the tokens will be assumed to be labeled if the first token’s value is a tuple or list.
-
-
class
nltk.classify.weka.
WekaClassifier
(formatter, model_filename)[source]¶ Bases:
nltk.classify.api.ClassifierI
-
classify_many
(featuresets)[source]¶ Apply
self.classify()
to each element offeaturesets
. I.e.:return [self.classify(fs) for fs in featuresets]Return type: list(label)
-
Module contents¶
Classes and interfaces for labeling tokens with category labels (or
“class labels”). Typically, labels are represented with strings
(such as 'health'
or 'sports'
). Classifiers can be used to
perform a wide range of classification tasks. For example,
classifiers can be used…
- to classify documents by topic
- to classify ambiguous words by which word sense is intended
- to classify acoustic signals by which phoneme they represent
- to classify sentences by their author
Features¶
In order to decide which category label is appropriate for a given token, classifiers examine one or more ‘features’ of the token. These “features” are typically chosen by hand, and indicate which aspects of the token are relevant to the classification decision. For example, a document classifier might use a separate feature for each word, recording how often that word occurred in the document.
Featuresets¶
The features describing a token are encoded using a “featureset”,
which is a dictionary that maps from “feature names” to “feature
values”. Feature names are unique strings that indicate what aspect
of the token is encoded by the feature. Examples include
'prevword'
, for a feature whose value is the previous word; and
'contains-word(library)'
for a feature that is true when a document
contains the word 'library'
. Feature values are typically
booleans, numbers, or strings, depending on which feature they
describe.
Featuresets are typically constructed using a “feature detector” (also known as a “feature extractor”). A feature detector is a function that takes a token (and sometimes information about its context) as its input, and returns a featureset describing that token. For example, the following feature detector converts a document (stored as a list of words) to a featureset describing the set of words included in the document:
>>> # Define a feature detector function.
>>> def document_features(document):
... return dict([('contains-word(%s)' % w, True) for w in document])
Feature detectors are typically applied to each token before it is fed to the classifier:
>>> # Classify each Gutenberg document.
>>> from nltk.corpus import gutenberg
>>> for fileid in gutenberg.fileids():
... doc = gutenberg.words(fileid)
... print fileid, classifier.classify(document_features(doc))
The parameters that a feature detector expects will vary, depending on the task and the needs of the feature detector. For example, a feature detector for word sense disambiguation (WSD) might take as its input a sentence, and the index of a word that should be classified, and return a featureset for that word. The following feature detector for WSD includes features describing the left and right contexts of the target word:
>>> def wsd_features(sentence, index):
... featureset = {}
... for i in range(max(0, index-3), index):
... featureset['left-context(%s)' % sentence[i]] = True
... for i in range(index, max(index+3, len(sentence))):
... featureset['right-context(%s)' % sentence[i]] = True
... return featureset
Training Classifiers¶
Most classifiers are built by training them on a list of hand-labeled
examples, known as the “training set”. Training sets are represented
as lists of (featuredict, label)
tuples.