Package nltk :: Package tag :: Module crf :: Class MalletCRF
[hide private]
[frames] | no frames]

Class MalletCRF

source code

       object --+        
                |        
      api.TaggerI --+    
                    |    
api.FeaturesetTaggerI --+
                        |
                       MalletCRF

A conditional random field tagger, which is trained and run by making external calls to Mallet. Tokens are converted to featuresets using a feature detector function:

   feature_detector(tokens, index) -> featureset

These featuresets are then encoded into feature vectors by converting each feature (name, value) pair to a unique binary feature.

Ecah MalletCRF object is backed by a crf model file. This model file is actually a zip file, and it contains one file for the serialized model (crf-model.ser) and one file for information about the structure of the CRF (crf-info.xml).

Instance Methods [hide private]
 
__init__(self, filename, feature_detector=None)
Create a new MalletCRF.
source code
 
_get_filename(self) source code
 
_get_feature_detector(self) source code
 
batch_tag(self, sentences)
Apply self.tag() to each element of sentences.
source code
 
write_training_corpus(self, corpus, stream, close_stream=True)
Write a given training corpus to a given stream, in a format that can be read by the java script org.nltk.mallet.TrainCRF.
source code
 
write_test_corpus(self, corpus, stream, close_stream=True)
Write a given test corpus to a given stream, in a format that can be read by the java script org.nltk.mallet.TestCRF.
source code
 
parse_mallet_output(self, s)
Parse the output that is generated by the java script org.nltk.mallet.TestCRF, and convert it to a labeled corpus.
source code
 
__repr__(self)
repr(x)
source code

Inherited from api.TaggerI: tag

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Class Methods [hide private]
 
train(cls, feature_detector, corpus, filename=None, weight_groups=None, gaussian_variance=1, default_label='O', transduction_type='VITERBI', max_iterations=500, add_start_state=True, add_end_state=True, trace=1)
Train a new linear chain CRF tagger based on the given corpus of training sequences.
source code
Static Methods [hide private]
 
_build_crf_info(corpus, gaussian_variance, default_label, max_iterations, transduction_type, weight_groups, add_start_state, add_end_state, model_filename, feature_detector)
Construct a CRFInfo object describing a CRF with a given set of configuration parameters, and based on the contents of a given corpus.
source code
 
_filter_training_output(p, trace)
Filter the (very verbose) output that is generated by mallet, and only display the interesting lines.
source code
 
_escape_sub(m) source code
 
_format_feature(fname, fval)
Return a string name for a given feature (name, value) pair, appropriate for consumption by mallet.
source code
Class Variables [hide private]
  _RUN_CRF = 'org.nltk.mallet.RunCRF'
The name of the java script used to run MalletCRFs.
  _TRAIN_CRF = 'org.nltk.mallet.TrainCRF'
The name of the java script used to train MalletCRFs.
  _FILTER_TRAINING_OUTPUT = [(1, 'DEBUG:.*'), (1, 'Number of wei...
A table used to filter the output that mallet generates during training.
  _ESCAPE_RE = re.compile(r'[^a-zA-Z0-9]')
Instance Variables [hide private]
  crf_info
A CRFInfo object describing this CRF.
Properties [hide private]
  filename
The filename of the crf model file that backs this MalletCRF.
  feature_detector
The feature detector function that is used to convert tokens to featuresets.

Inherited from object: __class__

Method Details [hide private]

__init__(self, filename, feature_detector=None)
(Constructor)

source code 

Create a new MalletCRF.

Parameters:
  • filename - The filename of the model file that backs this CRF.
  • feature_detector - The feature detector function that is used to convert tokens to featuresets. This parameter only needs to be given if the model file does not contain a pickled pointer to the feature detector (e.g., if the feature detector was a lambda function).
Overrides: object.__init__

batch_tag(self, sentences)

source code 

Apply self.tag() to each element of sentences. I.e.:

>>> return [self.tag(tokens) for tokens in sentences]
Overrides: api.TaggerI.batch_tag
(inherited documentation)

train(cls, feature_detector, corpus, filename=None, weight_groups=None, gaussian_variance=1, default_label='O', transduction_type='VITERBI', max_iterations=500, add_start_state=True, add_end_state=True, trace=1)
Class Method

source code 

Train a new linear chain CRF tagger based on the given corpus of training sequences. This tagger will be backed by a crf model file, containing both a serialized Mallet model and information about the CRF's structure. This crf model file will not be automatically deleted -- if you wish to delete it, you must delete it manually. The filename of the model file for a MalletCRF crf is available as crf.filename.

Parameters:
  • corpus (list of tuple) - Training data, represented as a list of sentences, where each sentence is a list of (token, tag) tuples.
  • filename (str) - The filename that should be used for the crf model file that backs the new MalletCRF. If no filename is given, then a new filename will be chosen automatically.
  • weight_groups (list of CRFInfo.WeightGroup) - Specifies how input-features should be mapped to joint-features. See CRFInfo.WeightGroup for more information.
  • gaussian_variance (float) - The gaussian variance of the prior that should be used to train the new CRF.
  • default_label (str) - The "label for initial context and uninteresting tokens" (from Mallet's SimpleTagger.java.) It's unclear whether this currently has any effect.
  • transduction_type (str) - The type of transduction used by the CRF. Can be VITERBI, VITERBI_FBEAM, VITERBI_BBEAM, VITERBI_FBBEAM, or VITERBI_FBEAMKL.
  • max_iterations (int) - The maximum number of iterations that should be used for training the CRF.
  • add_start_state (bool) - If true, then NLTK will add a special start state, named '__start__'. The initial cost for the start state will be set to 0; and the initial cost for all other states will be set to +inf.
  • add_end_state (bool) - If true, then NLTK will add a special end state, named '__end__'. The final cost for the end state will be set to 0; and the final cost for all other states will be set to +inf.
  • trace (int) - Controls the verbosity of trace output generated while training the CRF. Higher numbers generate more verbose output.

_filter_training_output(p, trace)
Static Method

source code 

Filter the (very verbose) output that is generated by mallet, and only display the interesting lines. The lines that are selected for display are determined by _FILTER_TRAINING_OUTPUT.

_format_feature(fname, fval)
Static Method

source code 

Return a string name for a given feature (name, value) pair, appropriate for consumption by mallet. We escape every character in fname or fval that's not a letter or a number, just to be conservative.

__repr__(self)
(Representation operator)

source code 

repr(x)

Overrides: object.__repr__
(inherited documentation)

Class Variable Details [hide private]

_FILTER_TRAINING_OUTPUT

A table used to filter the output that mallet generates during training. By default, mallet generates very verbose output. This table is used to select which lines of output are actually worth displaying to the user, based on the level of the trace parameter. Each entry of this table is a tuple (min_trace_level, regexp). A line will be displayed only if trace>=min_trace_level and the line matches regexp for at least one table entry.

Value:
[(1, 'DEBUG:.*'),
 (1, 'Number of weights.*'),
 (1, 'CRF about to train.*'),
 (1, 'CRF finished.*'),
 (1, 'CRF training has converged.*'),
 (2, 'CRF weights.*'),
 (2, 'getValue\\(\\) \\(loglikelihood\\) .*')]

Property Details [hide private]

filename

The filename of the crf model file that backs this MalletCRF. The crf model file is actually a zip file, and it contains one file for the serialized model (crf-model.ser) and one file for information about the structure of the CRF (crf-info.xml).

Get Method:
_get_filename(self)

feature_detector

The feature detector function that is used to convert tokens to featuresets. This function has the signature:

   feature_detector(tokens, index) -> featureset
Get Method:
_get_feature_detector(self)