1
2
3
4
5
6
7
8 """
9 Classes and interfaces for labeling tokens with category labels (or
10 X{class labels}). Typically, labels are represented with strings
11 (such as C{'health'} or C{'sports'}). Classifiers can be used to
12 perform a wide range of classification tasks. For example,
13 classifiers can be used...
14
15 - to classify documents by topic.
16 - to classify ambiguous words by which word sense is intended.
17 - to classify acoustic signals by which phoneme they represent.
18 - to classify sentences by their author.
19
20 Features
21 ========
22 In order to decide which category label is appropriate for a given
23 token, classifiers examine one or more 'features' of the token. These
24 X{features} are typically chosen by hand, and indicate which aspects
25 of the token are relevant to the classification decision. For
26 example, a document classifier might use a separate feature for each
27 word, recording how often that word occured in the document.
28
29 Featuresets
30 ===========
31 The features describing a token are encoded using a X{featureset},
32 which is a dictionary that maps from X{feature names} to X{feature
33 values}. Feature names are unique strings that indicate what aspect
34 of the token is encoded by the feature. Examples include
35 C{'prevword'}, for a feature whose value is the previous word; and
36 C{'contains-word(library)'} for a feature that is true when a document
37 contains the word C{'library'}. Feature values are typically
38 booleans, numbers, or strings, depending on which feature they
39 describe.
40
41 Featuresets are typically constructed using a X{feature detector}
42 (also known as a X{feature extractor}). A feature detector is a
43 function that takes a token (and sometimes information about its
44 context) as its input, and returns a featureset describing that token.
45 For example, the following feature detector converts a document
46 (stored as a list of words) to a featureset describing the set of
47 words included in the document:
48
49 >>> # Define a feature detector function.
50 >>> def document_features(document):
51 ... return dict([('contains-word(%s)' % w, True) for w in document])
52
53 Feature detectors are typically applied to each token before it is fed
54 to the classifier:
55
56 >>> Classify each Gutenberg document.
57 >>> for file in gutenberg.files():
58 ... doc = gutenberg.tokenized(file)
59 ... print doc_name, classifier.classify(document_features(doc))
60
61 The parameters that a feature detector expects will vary, depending on
62 the task and the needs of the feature detector. For example, a
63 feature detector for word sense disambiguation (WSD) might take as its
64 input a sentence, and the index of a word that should be classified,
65 and return a featureset for that word. The following feature detector
66 for WSD includes features describing the left and right contexts of
67 the target word:
68
69 >>> def wsd_features(sentence, index):
70 ... featureset = {}
71 ... for i in range(max(0, index-3), index):
72 ... featureset['left-context(%s)' % sentence[i]] = True
73 ... for i in range(index, max(index+3, len(sentence))
74 ... featureset['right-context(%s)' % sentence[i]] = True
75 ... return featureset
76
77 Training Classifiers
78 ====================
79 Most classifiers are built by training them on a list of hand-labeled
80 examples, known as the X{training set}. Training sets are represented
81 as lists of C{(featuredict, label)} tuples.
82 """
83
84 from nltk.internals import deprecated, Deprecated
85
86 from api import *
87 from util import *
88 from naivebayes import *
89 from decisiontree import *
90 from weka import *
91 from megam import *
92
93 __all__ = [
94
95 'ClassifierI', 'MultiClassifierI',
96
97
98 'NaiveBayesClassifier', 'DecisionTreeClassifier', 'WekaClassifier',
99
100
101
102
103 'config_weka', 'config_megam',
104
105
106 ]
107
108
109 try:
110 import numpy
111 from maxent import *
112 __all__ += ['MaxentClassifier', 'ConditionalExponentialClassifier',
113 'train_maxent_classifier']
114 except ImportError:
115 pass
116
117
118
119
120 from nltk.internals import Deprecated
122 """Use nltk.ClassifierI instead."""
123
124 @deprecated("Use nltk.classify.accuracy() instead.")
127 @deprecated("Use nltk.classify.log_likelihood() instead.")
130