train(cls,
feature_detector,
corpus,
filename=None,
weight_groups=None,
gaussian_variance=1,
default_label=' O ' ,
transduction_type=' VITERBI ' ,
max_iterations=500,
add_start_state=True,
add_end_state=True,
trace=1)
Class Method
| source code
|
Train a new linear chain CRF tagger based on the given corpus of
training sequences. This tagger will be backed by a crf model
file, containing both a serialized Mallet model and information about
the CRF's structure. This crf model file will not be
automatically deleted -- if you wish to delete it, you must delete it
manually. The filename of the model file for a MalletCRF
crf is available as crf.filename .
- Parameters:
corpus (list of tuple ) - Training data, represented as a list of sentences, where each
sentence is a list of (token, tag) tuples.
filename (str ) - The filename that should be used for the crf model file that
backs the new MalletCRF . If no filename is given,
then a new filename will be chosen automatically.
weight_groups (list of CRFInfo.WeightGroup) - Specifies how input-features should be mapped to joint-features.
See CRFInfo.WeightGroup for more information.
gaussian_variance (float ) - The gaussian variance of the prior that should be used to train
the new CRF.
default_label (str ) - The "label for initial context and uninteresting
tokens" (from Mallet's SimpleTagger.java.) It's unclear
whether this currently has any effect.
transduction_type (str ) - The type of transduction used by the CRF. Can be VITERBI,
VITERBI_FBEAM, VITERBI_BBEAM, VITERBI_FBBEAM, or VITERBI_FBEAMKL.
max_iterations (int ) - The maximum number of iterations that should be used for training
the CRF.
add_start_state (bool ) - If true, then NLTK will add a special start state, named
'__start__' . The initial cost for the start state
will be set to 0; and the initial cost for all other states will
be set to +inf.
add_end_state (bool ) - If true, then NLTK will add a special end state, named
'__end__' . The final cost for the end state will be
set to 0; and the final cost for all other states will be set to
+inf.
trace (int ) - Controls the verbosity of trace output generated while training
the CRF. Higher numbers generate more verbose output.
|