nltk.classify.maxent

A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistant with the training data; and chooses the distribution with the highest entropy. A probability distribution is empirically consistant with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data.

Terminology: 'feature'

The term feature is usually used to refer to some property of an unlabeled token. For example, when performing word sense disambiguation, we might define a 'prevword' feature whose value is the word preceeding the target word. However, in the context of maxent modeling, the term feature is typically used to refer to a property of a labeled token. In order to prevent confusion, we will introduce two distinct terms to disambiguate these two different concepts:

In the rest of the nltk.classify module, the term features is used to refer to what we will call input-features in this module.

In literature that describes and discusses maximum entropy models, input-features are typically called contexts, and joint-features are simply referred to as features.

Converting Input-Features to Joint-Features

In maximum entropy models, joint-features are required to have numeric values. Typically, each input-feature input_feat is mapped to a set of joint-features of the form:

For all values of feat_val and some_label. This mapping is performed by classes that implement the MaxentFeatureEncodingI interface.

calculate_nfmap(train_toks, encoding)

source code

Construct a map that can be used to compress nf (which is typically sparse).

nf(feature_vector) is the sum of the feature values for feature_vector.

This represents the number of features that are active for a given labeled text. This method finds all values of nf(t) that are attested for at least one token in the given list of training tokens; and constructs a dictionary mapping these attested values to a continuous range 0...N. For example, if the only values of nf() that were attested were 3, 5, and 7, then _nfmap might return the dictionary {3:0, 5:1, 7:2}.

Returns: dictionary from int to int: A map that can be used to compress nf to a dense vector.

calculate_deltas(train_toks, classifier, unattested, ffreq_empirical, nfmap, nfarray, nftranspose, encoding)

source code

Calculate the update values for the classifier weights for this iteration of IIS. These update weights are the value of delta that solves the equation:

 ffreq_empirical[i]
        =
 SUM[fs,l] (classifier.prob_classify(fs).prob(l) *
            feature_vector(fs,l)[i] *
            exp(delta[i] * nf(feature_vector(fs,l))))

Where:

(fs,l) is a (featureset, label) tuple from train_toks
feature_vector(fs,l) = encoding.encode(fs,l)
nf(vector) = sum([val for (id,val) in vector])

This method uses Newton's method to solve this equation for delta[i]. In particular, it starts with a guess of delta[i]=1; and iteratively updates delta with:

   delta[i] -= (ffreq_empirical[i] - sum1[i])/(-sum2[i])

until convergence, where sum1 and sum2 are defined as:

   sum1[i](delta) = SUM[fs,l] f[i](fs,l,delta)
   
   sum2[i](delta) = SUM[fs,l] (f[i](fs,l,delta) *
                               nf(feature_vector(fs,l)))
   
 f[i](fs,l,delta) = (classifier.prob_classify(fs).prob(l) *
                     feature_vector(fs,l)[i] *
                     exp(delta[i] * nf(feature_vector(fs,l))))

Note that sum1 and sum2 depend on delta; so they need to be re-computed each iteration.

The variables nfmap, nfarray, and nftranspose are used to generate a dense encoding for nf(ltext). This allows _deltas to calculate sum1 and sum2 using matrices, which yields a signifigant performance improvement.

Parameters:

train_toks (list of tuples of (dict, str)) - The set of training tokens.
classifier (ClassifierI) - The current classifier.
ffreq_empirical (sequence of float) - An array containing the empirical frequency for each feature. The ith element of this array is the empirical frequency for feature i.
unattested (sequence of int) - An array that is 1 for features that are not attested in the training data; and 0 for features that are attested. In other words, unattested[i]==0 iff ffreq_empirical[i]==0.
nfmap (dictionary from int to int) - A map that can be used to compress nf to a dense vector.
nfarray (array of float) - An array that can be used to uncompress nf from a dense vector.
nftranspose (The transpose of nfarray) - array of float

train_maxent_classifier_with_scipy(train_toks, trace=3, encoding=None, labels=None, algorithm=`'CG'`, sparse=True, gaussian_prior_sigma=0, **cutoffs)

source code

Train a new ConditionalExponentialClassifier, using the given training samples, using the specified scipy optimization algorithm. This ConditionalExponentialClassifier will encode the model that maximizes entropy from all the models that are empirically consistent with train_toks.

See Also: train_maxent_classifier() for parameter descriptions.

Requires: The scipy package must be installed.

Module maxent

Terminology: 'feature'

Converting Input-Features to Joint-Features

train_maxent_classifier_with_gis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)

train_maxent_classifier_with_iis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)

calculate_nfmap(train_toks, encoding)

calculate_deltas(train_toks, classifier, unattested, ffreq_empirical, nfmap, nfarray, nftranspose, encoding)

train_maxent_classifier_with_scipy(train_toks, trace=3, encoding=None, labels=None, algorithm=`'CG'`, sparse=True, gaussian_prior_sigma=0, **cutoffs)

train_maxent_classifier_with_megam(train_toks, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **cutoffs)

Module maxent

Terminology: 'feature'

Converting Input-Features to Joint-Features

train_maxent_classifier_with_gis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)

train_maxent_classifier_with_iis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)

calculate_nfmap(train_toks, encoding)

calculate_deltas(train_toks, classifier, unattested, ffreq_empirical, nfmap, nfarray, nftranspose, encoding)

train_maxent_classifier_with_scipy(train_toks, trace=3, encoding=None, labels=None, algorithm='CG', sparse=True, gaussian_prior_sigma=0, **cutoffs)

train_maxent_classifier_with_megam(train_toks, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **cutoffs)

train_maxent_classifier_with_scipy(train_toks, trace=3, encoding=None, labels=None, algorithm=`'CG'`, sparse=True, gaussian_prior_sigma=0, **cutoffs)