Package nltk :: Package classify
[hide private]
[frames] | no frames]

Source Code for Package nltk.classify

  1  # Natural Language Toolkit: Classifiers 
  2  # 
  3  # Copyright (C) 2001-2008 NLTK Project 
  4  # Author: Edward Loper <edloper@gradient.cis.upenn.edu> 
  5  # URL: <http://nltk.org> 
  6  # For license information, see LICENSE.TXT 
  7   
  8  """ 
  9  Classes and interfaces for labeling tokens with category labels (or 
 10  X{class labels}).  Typically, labels are represented with strings 
 11  (such as C{'health'} or C{'sports'}).  Classifiers can be used to 
 12  perform a wide range of classification tasks.  For example, 
 13  classifiers can be used... 
 14   
 15    - to classify documents by topic. 
 16    - to classify ambiguous words by which word sense is intended. 
 17    - to classify acoustic signals by which phoneme they represent. 
 18    - to classify sentences by their author. 
 19   
 20  Features 
 21  ======== 
 22  In order to decide which category label is appropriate for a given 
 23  token, classifiers examine one or more 'features' of the token.  These 
 24  X{features} are typically chosen by hand, and indicate which aspects 
 25  of the token are relevant to the classification decision.  For 
 26  example, a document classifier might use a separate feature for each 
 27  word, recording how often that word occured in the document. 
 28   
 29  Featuresets 
 30  =========== 
 31  The features describing a token are encoded using a X{featureset}, 
 32  which is a dictionary that maps from X{feature names} to X{feature 
 33  values}.  Feature names are unique strings that indicate what aspect 
 34  of the token is encoded by the feature.  Examples include 
 35  C{'prevword'}, for a feature whose value is the previous word; and 
 36  C{'contains-word(library)'} for a feature that is true when a document 
 37  contains the word C{'library'}.  Feature values are typically 
 38  booleans, numbers, or strings, depending on which feature they 
 39  describe. 
 40   
 41  Featuresets are typically constructed using a X{feature detector} 
 42  (also known as a X{feature extractor}).  A feature detector is a 
 43  function that takes a token (and sometimes information about its 
 44  context) as its input, and returns a featureset describing that token. 
 45  For example, the following feature detector converts a document 
 46  (stored as a list of words) to a featureset describing the set of 
 47  words included in the document: 
 48   
 49      >>> # Define a feature detector function. 
 50      >>> def document_features(document): 
 51      ...     return dict([('contains-word(%s)' % w, True) for w in document]) 
 52   
 53  Feature detectors are typically applied to each token before it is fed 
 54  to the classifier: 
 55   
 56      >>> Classify each Gutenberg document. 
 57      >>> for file in gutenberg.files(): 
 58      ...     doc = gutenberg.tokenized(file) 
 59      ...     print doc_name, classifier.classify(document_features(doc)) 
 60   
 61  The parameters that a feature detector expects will vary, depending on 
 62  the task and the needs of the feature detector.  For example, a 
 63  feature detector for word sense disambiguation (WSD) might take as its 
 64  input a sentence, and the index of a word that should be classified, 
 65  and return a featureset for that word.  The following feature detector 
 66  for WSD includes features describing the left and right contexts of 
 67  the target word: 
 68   
 69      >>> def wsd_features(sentence, index): 
 70      ...     featureset = {} 
 71      ...     for i in range(max(0, index-3), index): 
 72      ...         featureset['left-context(%s)' % sentence[i]] = True 
 73      ...     for i in range(index, max(index+3, len(sentence)) 
 74      ...         featureset['right-context(%s)' % sentence[i]] = True 
 75      ...     return featureset 
 76   
 77  Training Classifiers 
 78  ==================== 
 79  Most classifiers are built by training them on a list of hand-labeled 
 80  examples, known as the X{training set}.  Training sets are represented 
 81  as lists of C{(featuredict, label)} tuples. 
 82  """ 
 83   
 84  from nltk.internals import deprecated, Deprecated 
 85   
 86  from api import * 
 87  from util import * 
 88  from naivebayes import * 
 89  from decisiontree import * 
 90  from weka import * 
 91  from megam import * 
 92   
 93  __all__ = [ 
 94      # Classifier Interfaces 
 95      'ClassifierI', 'MultiClassifierI', 
 96       
 97      # Classifiers 
 98      'NaiveBayesClassifier', 'DecisionTreeClassifier', 'WekaClassifier', 
 99       
100      # Utility functions.  Note that accuracy() is intentionally 
101      # omitted -- it should be accessed as nltk.classify.accuracy(); 
102      # similarly for log_likelihood() and attested_labels(). 
103      'config_weka', 'config_megam', 
104       
105      # Demos -- not included. 
106      ] 
107       
108   
109  try: 
110      import numpy 
111      from maxent import * 
112      __all__ += ['MaxentClassifier', 'ConditionalExponentialClassifier', 
113                  'train_maxent_classifier'] 
114  except ImportError: 
115      pass 
116   
117  ###################################################################### 
118  #{ Deprecated 
119  ###################################################################### 
120  from nltk.internals import Deprecated 
121 -class ClassifyI(ClassifierI, Deprecated):
122 """Use nltk.ClassifierI instead."""
123 124 @deprecated("Use nltk.classify.accuracy() instead.")
125 -def classifier_accuracy(classifier, gold):
126 return accuracy(classifier, gold)
127 @deprecated("Use nltk.classify.log_likelihood() instead.")
128 -def classifier_log_likelihood(classifier, gold):
129 return log_likelihood(classifier, gold)
130