train(cls,
feature_detector,
corpus,
filename=None,
weight_groups=None,
gaussian_variance=1,
default_label='O',
transduction_type='VITERBI',
max_iterations=500,
add_start_state=True,
add_end_state=True,
trace=1)
Class Method
| source code
|
Train a new linear chain CRF tagger based on the given corpus of
training sequences. This tagger will be backed by a crf model
file, containing both a serialized Mallet model and information about
the CRF's structure. This crf model file will not be
automatically deleted -- if you wish to delete it, you must delete it
manually. The filename of the model file for a MalletCRF
crf is available as crf.filename.
- Parameters:
corpus (list of tuple) - Training data, represented as a list of sentences, where each
sentence is a list of (token, tag) tuples.
filename (str) - The filename that should be used for the crf model file that
backs the new MalletCRF. If no filename is given,
then a new filename will be chosen automatically.
weight_groups (list of CRFInfo.WeightGroup) - Specifies how input-features should be mapped to joint-features.
See CRFInfo.WeightGroup for more information.
gaussian_variance (float) - The gaussian variance of the prior that should be used to train
the new CRF.
default_label (str) - The "label for initial context and uninteresting
tokens" (from Mallet's SimpleTagger.java.) It's unclear
whether this currently has any effect.
transduction_type (str) - The type of transduction used by the CRF. Can be VITERBI,
VITERBI_FBEAM, VITERBI_BBEAM, VITERBI_FBBEAM, or VITERBI_FBEAMKL.
max_iterations (int) - The maximum number of iterations that should be used for training
the CRF.
add_start_state (bool) - If true, then NLTK will add a special start state, named
'__start__'. The initial cost for the start state
will be set to 0; and the initial cost for all other states will
be set to +inf.
add_end_state (bool) - If true, then NLTK will add a special end state, named
'__end__'. The final cost for the end state will be
set to 0; and the final cost for all other states will be set to
+inf.
trace (int) - Controls the verbosity of trace output generated while training
the CRF. Higher numbers generate more verbose output.
|