Classifiers

Classifiers label tokens with category labels (or class labels). Typically, labels are represented with strings (such as "health" or "sports". In NLTK, classifiers are defined using classes that implement the ClassifyI interface:

 
>>> import nltk
>>> nltk.usage(nltk.ClassifierI)
ClassifierI supports the following operations:
  - self.batch_classify(featuresets)
  - self.batch_prob_classify(featuresets)
  - self.classify(featureset)
  - self.labels()
  - self.prob_classify(featureset)

Currently, NLTK defines four classifier classes:

Classifiers are typically created by training them on a training corpus.

1
   Regression Tests

We define a very simple training corpus with 3 binary features: ['a', 'b', 'c'], and are two labels: ['x', 'y']. We us a simple reason so that the correct answers can be calcualted analytically (although we haven't done this yet for all tests.

 
>>> train = [
...     (dict(a=1,b=1,c=1), 'y'),
...     (dict(a=1,b=1,c=1), 'x'),
...     (dict(a=1,b=1,c=0), 'y'),
...     (dict(a=0,b=1,c=1), 'x'),
...     (dict(a=0,b=1,c=1), 'y'),
...     (dict(a=0,b=0,c=1), 'y'),
...     (dict(a=0,b=1,c=0), 'x'),
...     (dict(a=0,b=0,c=0), 'x'),
...     (dict(a=0,b=1,c=1), 'y'),
...     ]
>>> test = [
...     (dict(a=1,b=0,c=1)), # unseen
...     (dict(a=1,b=0,c=0)), # unseen
...     (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x
...     (dict(a=0,b=1,c=0)), # seen 1 time, label=x
...     ]

Test the naive bayes classifier:

 
>>> classifier = nltk.NaiveBayesClassifier.train(train)
>>> sorted(classifier.labels())
['x', 'y']
>>> classifier.batch_classify(test)
['y', 'x', 'y', 'x']
>>> for pdist in classifier.batch_prob_classify(test):
...     print '%.4f %.4f' % (pdist.prob('x'), pdist.prob('y'))
0.3104 0.6896
0.5746 0.4254
0.3685 0.6315
0.6365 0.3635
>>> classifier.show_most_informative_features()

Most Informative Features
                       c = 0                    x : y     =      2.0 : 1.0
                       c = 1                    y : x     =      1.5 : 1.0
                       a = 1                    y : x     =      1.4 : 1.0
                       a = 0                    x : y     =      1.2 : 1.0
                       b = 0                    x : y     =      1.2 : 1.0
                       b = 1                    y : x     =      1.1 : 1.0

Test the decision tree classifier:

 
>>> classifier = nltk.DecisionTreeClassifier.train(
...     train, entropy_cutoff=0,
...                                                support_cutoff=0)
best stump for    9 toks uses                    c err=0.3333
best stump for    3 toks uses                    a err=0.0000
best stump for    6 toks uses                 None err=0.3333
>>> sorted(classifier.labels())
['x', 'y']
>>> print classifier
c=0? .................................................. x
  a=0? ................................................ x
  a=1? ................................................ y
c=1? .................................................. y

>>> classifier.batch_classify(test)
['y', 'y', 'y', 'x']
>>> for pdist in classifier.batch_prob_classify(test):
...     print '%.4f %.4f' % (pdist.prob('x'), pdist.prob('y'))
Traceback (most recent call last):
  . . .
NotImplementedError

Test the maxent classifier training algorithms; they should all generate the same results.

 
>>> def test_maxent(algorithms):
...     classifiers = {}
...     for algorithm in nltk.classify.MaxentClassifier.ALGORITHMS:
...         if algorithm.lower() == 'megam':
...             try: nltk.classify.megam.config_megam()
...             except: raise #continue
...         classifiers[algorithm] = nltk.MaxentClassifier.train(
...             train, algorithm, trace=0, max_iter=1000)
...     print ' '*11+''.join(['      test[%s]  ' % i
...                           for i in range(len(test))])
...     print ' '*11+'     p(x)  p(y)'*len(test)
...     print '-'*(11+15*len(test))
...     for algorithm, classifier in classifiers.items():
...         print '%11s' % algorithm,
...         for featureset in test:
...             pdist = classifier.prob_classify(featureset)
...             print '%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')),
...         print
>>> test_maxent(nltk.classify.MaxentClassifier.ALGORITHMS)
[Found megam: ...]
                 test[0]        test[1]        test[2]        test[3]
                p(x)  p(y)     p(x)  p(y)     p(x)  p(y)     p(x)  p(y)
-----------------------------------------------------------------------
     LBFGSB     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
        GIS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
        IIS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
Nelder-Mead     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
         CG     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
       BFGS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
      MEGAM     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
     Powell     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24