Package nltk :: Package tag :: Module crf :: Class MalletCRF

Class MalletCRF

       object --+        
                |        
      api.TaggerI --+    
                    |    
api.FeaturesetTaggerI --+
                        |
                       MalletCRF

A conditional random field tagger, which is trained and run by making external calls to Mallet. Tokens are converted to featuresets using a feature detector function:

   feature_detector(tokens, index) -> featureset

These featuresets are then encoded into feature vectors by converting each feature (name, value) pair to a unique binary feature.

Ecah MalletCRF object is backed by a crf model file. This model file is actually a zip file, and it contains one file for the serialized model (crf-model.ser) and one file for information about the structure of the CRF (crf-info.xml).

Instance Methods

[hide private]

__init__(self, filename, feature_detector=None)
Create a new MalletCRF. source code

_get_filename(self)

source code

_get_feature_detector(self)

source code

batch_tag(self, sentences)
Apply self.tag() to each element of sentences. source code

write_training_corpus(self, corpus, stream, close_stream=True)
Write a given training corpus to a given stream, in a format that can be read by the java script org.nltk.mallet.TrainCRF. source code

write_test_corpus(self, corpus, stream, close_stream=True)
Write a given test corpus to a given stream, in a format that can be read by the java script org.nltk.mallet.TestCRF. source code

parse_mallet_output(self, s)
Parse the output that is generated by the java script org.nltk.mallet.TestCRF, and convert it to a labeled corpus. source code

__repr__(self)
repr(x)

source code

Inherited from api.TaggerI: tag

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Class Methods

[hide private]

train(cls, feature_detector, corpus, filename=None, weight_groups=None, gaussian_variance=1, default_label='O', transduction_type='VITERBI', max_iterations=500, add_start_state=True, add_end_state=True, trace=1)
Train a new linear chain CRF tagger based on the given corpus of training sequences. source code

Static Methods

[hide private]

_build_crf_info(corpus, gaussian_variance, default_label, max_iterations, transduction_type, weight_groups, add_start_state, add_end_state, model_filename, feature_detector)
Construct a CRFInfo object describing a CRF with a given set of configuration parameters, and based on the contents of a given corpus. source code

_filter_training_output(p, trace)
Filter the (very verbose) output that is generated by mallet, and only display the interesting lines.

source code

_escape_sub(m)

source code

_format_feature(fname, fval)
Return a string name for a given feature (name, value) pair, appropriate for consumption by mallet.

source code

Class Variables

[hide private]

_RUN_CRF = 'org.nltk.mallet.RunCRF'
The name of the java script used to run MalletCRFs.

_TRAIN_CRF = 'org.nltk.mallet.TrainCRF'
The name of the java script used to train MalletCRFs.

_FILTER_TRAINING_OUTPUT = [(1, 'DEBUG:.*'), (1, 'Number of wei...
A table used to filter the output that mallet generates during training.

_ESCAPE_RE = re.compile(r'[^a-zA-Z0-9]')

Instance Variables

[hide private]

crf_info
A CRFInfo object describing this CRF.

Properties

[hide private]

filename
The filename of the crf model file that backs this MalletCRF.

feature_detector
The feature detector function that is used to convert tokens to featuresets.

Inherited from object: __class__

Method Details

[hide private]

init(self, filename, feature_detector=None)
(Constructor)

source code

Create a new MalletCRF.

Parameters:

filename - The filename of the model file that backs this CRF.
feature_detector - The feature detector function that is used to convert tokens to featuresets. This parameter only needs to be given if the model file does not contain a pickled pointer to the feature detector (e.g., if the feature detector was a lambda function).

Overrides: object.__init__

batch_tag(self, sentences)

source code

Apply self.tag() to each element of sentences. I.e.:

>>> return [self.tag(tokens) for tokens in sentences]

Overrides: api.TaggerI.batch_tag: (inherited documentation)

train(cls, feature_detector, corpus, filename=None, weight_groups=None, gaussian_variance=1, default_label=`'O'`, transduction_type=`'VITERBI'`, max_iterations=500, add_start_state=True, add_end_state=True, trace=1)
Class Method

source code

Train a new linear chain CRF tagger based on the given corpus of training sequences. This tagger will be backed by a crf model file, containing both a serialized Mallet model and information about the CRF's structure. This crf model file will not be automatically deleted -- if you wish to delete it, you must delete it manually. The filename of the model file for a MalletCRF crf is available as crf.filename.

Parameters:

corpus (list of tuple) - Training data, represented as a list of sentences, where each sentence is a list of (token, tag) tuples.
filename (str) - The filename that should be used for the crf model file that backs the new MalletCRF. If no filename is given, then a new filename will be chosen automatically.
weight_groups (list of CRFInfo.WeightGroup) - Specifies how input-features should be mapped to joint-features. See CRFInfo.WeightGroup for more information.
gaussian_variance (float) - The gaussian variance of the prior that should be used to train the new CRF.
default_label (str) - The "label for initial context and uninteresting tokens" (from Mallet's SimpleTagger.java.) It's unclear whether this currently has any effect.
transduction_type (str) - The type of transduction used by the CRF. Can be VITERBI, VITERBI_FBEAM, VITERBI_BBEAM, VITERBI_FBBEAM, or VITERBI_FBEAMKL.
max_iterations (int) - The maximum number of iterations that should be used for training the CRF.
add_start_state (bool) - If true, then NLTK will add a special start state, named '__start__'. The initial cost for the start state will be set to 0; and the initial cost for all other states will be set to +inf.
add_end_state (bool) - If true, then NLTK will add a special end state, named '__end__'. The final cost for the end state will be set to 0; and the final cost for all other states will be set to +inf.
trace (int) - Controls the verbosity of trace output generated while training the CRF. Higher numbers generate more verbose output.

_filter_training_output(p, trace)
Static Method

source code

Filter the (very verbose) output that is generated by mallet, and only display the interesting lines. The lines that are selected for display are determined by _FILTER_TRAINING_OUTPUT.

_format_feature(fname, fval)
Static Method

source code

Return a string name for a given feature (name, value) pair, appropriate for consumption by mallet. We escape every character in fname or fval that's not a letter or a number, just to be conservative.

repr(self)
(Representation operator)

source code

repr(x)

Overrides: object.__repr__: (inherited documentation)

Class Variable Details

[hide private]

_FILTER_TRAINING_OUTPUT

A table used to filter the output that mallet generates during training. By default, mallet generates very verbose output. This table is used to select which lines of output are actually worth displaying to the user, based on the level of the trace parameter. Each entry of this table is a tuple (min_trace_level, regexp). A line will be displayed only if trace>=min_trace_level and the line matches regexp for at least one table entry.

Value:

[(1, 'DEBUG:.*'),
 (1, 'Number of weights.*'),
 (1, 'CRF about to train.*'),
 (1, 'CRF finished.*'),
 (1, 'CRF training has converged.*'),
 (2, 'CRF weights.*'),
 (2, 'getValue\\(\\) \\(loglikelihood\\) .*')]

Property Details

[hide private]

filename

The filename of the crf model file that backs this MalletCRF. The crf model file is actually a zip file, and it contains one file for the serialized model (crf-model.ser) and one file for information about the structure of the CRF (crf-info.xml).

Get Method:: _get_filename(self)

feature_detector

The feature detector function that is used to convert tokens to featuresets. This function has the signature:

   feature_detector(tokens, index) -> featureset

Get Method:: _get_feature_detector(self)

Class MalletCRF

__init__(self, filename, feature_detector=None) (Constructor)

batch_tag(self, sentences)

train(cls, feature_detector, corpus, filename=None, weight_groups=None, gaussian_variance=1, default_label='O', transduction_type='VITERBI', max_iterations=500, add_start_state=True, add_end_state=True, trace=1) Class Method

_filter_training_output(p, trace) Static Method

_format_feature(fname, fval) Static Method

__repr__(self) (Representation operator)

_FILTER_TRAINING_OUTPUT

filename

feature_detector

init(self, filename, feature_detector=None)
(Constructor)

train(cls, feature_detector, corpus, filename=None, weight_groups=None, gaussian_variance=1, default_label=`'O'`, transduction_type=`'VITERBI'`, max_iterations=500, add_start_state=True, add_end_state=True, trace=1)
Class Method

_filter_training_output(p, trace)
Static Method

_format_feature(fname, fval)
Static Method

repr(self)
(Representation operator)