nltk.tbl package¶

Submodules¶

nltk.tbl.api module¶

nltk.tbl.demo module¶

nltk.tbl.demo.corpus_size(seqs)[source]¶

nltk.tbl.demo.demo()[source]¶: Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions.

nltk.tbl.demo.demo_error_analysis()[source]¶: Writes a file with context for each erroneous word after tagging testing data

nltk.tbl.demo.demo_generated_templates()[source]¶

Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details.

Note: training with 500 templates can easily fill all available even on relatively small corpora

nltk.tbl.demo.demo_high_accuracy_rules()[source]¶: Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human.

nltk.tbl.demo.demo_learning_curve()[source]¶: Plot a learning curve – the contribution on tagging accuracy of the individual rules. Note: requires matplotlib

nltk.tbl.demo.demo_multifeature_template()[source]¶: Templates can have more than a single feature.

nltk.tbl.demo.demo_multiposition_feature()[source]¶

The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right.

For contiguous ranges, a 2-arg form giving inclusive end points can also be used: Pos(-3, -1) is the same as the arg below.

nltk.tbl.demo.demo_repr_rule_format()[source]¶: Exemplify repr(Rule) (see also str(Rule) and Rule.format(“verbose”))

nltk.tbl.demo.demo_serialize_tagger()[source]¶: Serializes the learned tagger to a file in pickle format; reloads it and validates the process.

nltk.tbl.demo.demo_str_rule_format()[source]¶: Exemplify repr(Rule) (see also str(Rule) and Rule.format(“verbose”))

nltk.tbl.demo.demo_template_statistics()[source]¶

Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined.

Deleting unused templates is mostly about saving time and/or space: training is basically O(T) in the number of templates T (also in terms of memory usage, which often will be the limiting factor).

nltk.tbl.demo.demo_verbose_rule_format()[source]¶: Exemplify Rule.format(“verbose”)

nltk.tbl.demo.postag(templates=None, tagged_data=None, num_sents=1000, max_rules=300, min_score=3, min_acc=None, train=0.8, trace=3, randomize=False, ruleformat='str', incremental_stats=False, template_stats=False, error_output=None, serialize_output=None, learning_curve_output=None, learning_curve_take=300, baseline_backoff_tagger=None, separate_baseline_data=False, cache_baseline_tagger=None)[source]¶

Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template

Parameters:

Parameters:	tagged_data (C{int}) – maximum number of rule instances to create num_sents (C{int}) – how many sentences of training and testing data to use max_rules (C{int}) – maximum number of rule instances to create min_score (C{int}) – the minimum score for a rule in order for it to be considered min_acc (C{float}) – the minimum score for a rule in order for it to be considered train (C{float}) – the fraction of the the corpus to be used for training (1=all) trace (C{int}) – the level of diagnostic tracing output to produce (0-4) randomize (C{bool}) – whether the training data should be a random subset of the corpus ruleformat (C{str}) – rule output format, one of “str”, “repr”, “verbose” incremental_stats (C{bool}) – if true, will tag incrementally and collect stats for each rule (rather slow) template_stats (C{bool}) – if true, will print per-template statistics collected in training and (optionally) testing error_output (C{string}) – the file where errors will be saved serialize_output (C{string}) – the file where the learned tbl tagger will be saved learning_curve_output (C{string}) – filename of plot of learning curve(s) (train and also test, if available) learning_curve_take (C{int}) – how many rules plotted baseline_backoff_tagger (tagger) – the file where rules will be saved separate_baseline_data (C{bool}) – use a fraction of the training data exclusively for training baseline cache_baseline_tagger (C{string}) – cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions)

tagged_data (C{int}) – maximum number of rule instances to create
num_sents (C{int}) – how many sentences of training and testing data to use
max_rules (C{int}) – maximum number of rule instances to create
min_score (C{int}) – the minimum score for a rule in order for it to be considered
min_acc (C{float}) – the minimum score for a rule in order for it to be considered
train (C{float}) – the fraction of the the corpus to be used for training (1=all)
trace (C{int}) – the level of diagnostic tracing output to produce (0-4)
randomize (C{bool}) – whether the training data should be a random subset of the corpus
ruleformat (C{str}) – rule output format, one of “str”, “repr”, “verbose”
incremental_stats (C{bool}) – if true, will tag incrementally and collect stats for each rule (rather slow)
template_stats (C{bool}) – if true, will print per-template statistics collected in training and (optionally) testing
error_output (C{string}) – the file where errors will be saved
serialize_output (C{string}) – the file where the learned tbl tagger will be saved
learning_curve_output (C{string}) – filename of plot of learning curve(s) (train and also test, if available)
learning_curve_take (C{int}) – how many rules plotted
baseline_backoff_tagger (tagger) – the file where rules will be saved
separate_baseline_data (C{bool}) – use a fraction of the training data exclusively for training baseline
cache_baseline_tagger (C{string}) – cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions)

Note on separate_baseline_data: if True, reuse training data both for baseline and rule learner. This is fast and fine for a demo, but is likely to generalize worse on unseen data. Also cannot be sensibly used for learning curves on training data (the baseline will be artificially high).

nltk.tbl.erroranalysis module¶

nltk.tbl.erroranalysis.error_list(train_sents, test_sents)[source]¶

Returns a list of human-readable strings indicating the errors in the given tagging of the corpus.

Parameters:	train_sents (list(tuple)) – The correct tagging of the corpus test_sents (list(tuple)) – The tagged corpus

nltk.tbl.feature module¶

class nltk.tbl.feature.Feature(positions, end=None)[source]¶

Bases: object

An abstract base class for Features. A Feature is a combination of a specific property-computing method and a list of relative positions to apply that method to.

The property-computing method, M{extract_property(tokens, index)}, must be implemented by every subclass. It extracts or computes a specific property for the token at the current index. Typical extract_property() methods return features such as the token text or tag; but more involved methods may consider the entire sequence M{tokens} and for instance compute the length of the sentence the token belongs to.

In addition, the subclass may have a PROPERTY_NAME, which is how it will be printed (in Rules and Templates, etc). If not given, defaults to the classname.

PROPERTY_NAME = None¶

classmethod decode_json_obj(obj)[source]¶

encode_json_obj()[source]¶

classmethod expand(starts, winlens, excludezero=False)[source]¶

Return a list of features, one for each start point in starts and for each window length in winlen. If excludezero is True, no Features containing 0 in its positions will be generated (many tbl trainers have a special representation for the target feature at [0])

For instance, importing a concrete subclass (Feature is abstract) >>> from nltk.tag.brill import Word

First argument gives the possible start positions, second the possible window lengths >>> Word.expand([-3,-2,-1], [1]) [Word([-3]), Word([-2]), Word([-1])]

>>> Word.expand([-2,-1], [1])
[Word([-2]), Word([-1])]

>>> Word.expand([-3,-2,-1], [1,2])
[Word([-3]), Word([-2]), Word([-1]), Word([-3, -2]), Word([-2, -1])]

>>> Word.expand([-2,-1], [1])
[Word([-2]), Word([-1])]

a third optional argument excludes all Features whose positions contain zero >>> Word.expand([-2,-1,0], [1,2], excludezero=False) [Word([-2]), Word([-1]), Word([0]), Word([-2, -1]), Word([-1, 0])]

>>> Word.expand([-2,-1,0], [1,2], excludezero=True)
[Word([-2]), Word([-1]), Word([-2, -1])]

All window lengths must be positive >>> Word.expand([-2,-1], [0]) Traceback (most recent call last):

File “<stdin>”, line 1, in <module> File “nltk/tag/tbl/template.py”, line 371, in expand

param starts: where to start looking for Feature

param starts:	where to start looking for Feature

ValueError: non-positive window length in [0]

Parameters:	starts (list of ints) – where to start looking for Feature winlens – window lengths where to look for Feature excludezero (bool) – do not output any Feature with 0 in any of its positions.
Returns:	list of Features
Raises:	ValueError – for non-positive window lengths

static extract_property(tokens, index)[source]¶

Any subclass of Feature must define static method extract_property(tokens, index)

Parameters:	tokens (list of tokens) – the sequence of tokens index (int) – the current index
Returns:	feature value
Return type:	any (but usually scalar)

intersects(other)[source]¶

Return True if the positions of this Feature intersects with those of other

More precisely, return True if this feature refers to the same property as other; and there is some overlap in the positions they look at.

#For instance, importing a concrete subclass (Feature is abstract) >>> from nltk.tag.brill import Word, Pos

>>> Word([-3,-2,-1]).intersects(Word([-3,-2]))
True

>>> Word([-3,-2,-1]).intersects(Word([-3,-2, 0]))
True

>>> Word([-3,-2,-1]).intersects(Word([0]))
False

#Feature subclasses must agree >>> Word([-3,-2,-1]).intersects(Pos([-3,-2])) False

Parameters:	other ((subclass of) Feature) – feature with which to compare
Returns:	True if feature classes agree and there is some overlap in the positions they look at
Return type:	bool

issuperset(other)[source]¶

Return True if this Feature always returns True when other does

More precisely, return True if this feature refers to the same property as other; and this Feature looks at all positions that other does (and possibly other positions in addition).

#For instance, importing a concrete subclass (Feature is abstract) >>> from nltk.tag.brill import Word, Pos

>>> Word([-3,-2,-1]).issuperset(Word([-3,-2]))
True

>>> Word([-3,-2,-1]).issuperset(Word([-3,-2, 0]))
False

#Feature subclasses must agree >>> Word([-3,-2,-1]).issuperset(Pos([-3,-2])) False

Parameters:	other ((subclass of) Feature) – feature with which to compare
Returns:	True if this feature is superset, otherwise False
Return type:	bool

json_tag = 'nltk.tbl.Feature'¶

nltk.tbl.rule module¶

class nltk.tbl.rule.Rule(templateid, original_tag, replacement_tag, conditions)[source]¶

Bases: nltk.tbl.rule.TagRule

A Rule checks the current corpus position for a certain set of conditions; if they are all fulfilled, the Rule is triggered, meaning that it will change tag A to tag B. For other tags than A, nothing happens.

The conditions are parameters to the Rule instance. Each condition is a feature-value pair, with a set of positions to check for the value of the corresponding feature. Conceptually, the positions are joined by logical OR, and the feature set by logical AND.

More formally, the Rule is then applicable to the M{n}th token iff:

The M{n}th token is tagged with the Rule’s original tag; and

For each (Feature(positions), M{value}) tuple: - The value of Feature of at least one token in {n+p for p in positions}

is M{value}.

applies(tokens, index)[source]¶

Returns:	True if the rule would change the tag of `tokens[index]`, False otherwise
Return type:	bool
Parameters:	tokens (list(str)) – A tagged sentence index (int) – The index to check

classmethod decode_json_obj(obj)[source]¶

encode_json_obj()[source]¶

format(fmt)[source]¶

Return a string representation of this rule.

>>> from nltk.tbl.rule import Rule
>>> from nltk.tag.brill import Pos

>>> r = Rule("23", "VB", "NN", [(Pos([-2,-1]), 'DT')])

r.format(“str”) == str(r) True >>> r.format(“str”) ‘VB->NN if Pos:DT@[-2,-1]’

r.format(“repr”) == repr(r) True >>> r.format(“repr”) “Rule(‘23’, ‘VB’, ‘NN’, [(Pos([-2, -1]),’DT’)])”

>>> r.format("verbose")
'VB -> NN if the Pos of words i-2...i-1 is "DT"'

>>> r.format("not_found")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "nltk/tbl/rule.py", line 256, in format
    raise ValueError("unknown rule format spec: {0}".format(fmt))
ValueError: unknown rule format spec: not_found
>>>

Parameters:	fmt (str) – format specification
Returns:	string representation
Return type:	str

json_tag = 'nltk.tbl.Rule'¶

unicode_repr()¶: Return repr(self).

class nltk.tbl.rule.TagRule(original_tag, replacement_tag)[source]¶

Bases: object

An interface for tag transformations on a tagged corpus, as performed by tbl taggers. Each transformation finds all tokens in the corpus that are tagged with a specific original tag and satisfy a specific condition, and replaces their tags with a replacement tag. For any given transformation, the original tag, replacement tag, and condition are fixed. Conditions may depend on the token under consideration, as well as any other tokens in the corpus.

Tag rules must be comparable and hashable.

applies(tokens, index)[source]¶

Returns:	True if the rule would change the tag of `tokens[index]`, False otherwise
Return type:	bool
Parameters:	tokens (list(str)) – A tagged sentence index (int) – The index to check

apply(tokens, positions=None)[source]¶

Apply this rule at every position in positions where it applies to the given sentence. I.e., for each position p in positions, if tokens[p] is tagged with this rule’s original tag, and satisfies this rule’s condition, then set its tag to be this rule’s replacement tag.

Parameters:	tokens (list(tuple(str, str))) – The tagged sentence positions (list(int)) – The positions where the transformation is to be tried. If not specified, try it at all positions.
Returns:	The indices of tokens whose tags were changed by this rule.
Return type:	int

original_tag = None¶: The tag which this TagRule may cause to be replaced.

replacement_tag = None¶: The tag with which this TagRule may replace another tag.

nltk.tbl.template module¶

class nltk.tbl.template.BrillTemplateI[source]¶

Bases: object

An interface for generating lists of transformational rules that apply at given sentence positions. BrillTemplateI is used by Brill training algorithms to generate candidate rules.

applicable_rules(tokens, i, correctTag)[source]¶

Return a list of the transformational rules that would correct the i*th subtoken’s tag in the given token. In particular, return a list of zero or more rules that would change *tokens*[i][1] to *correctTag, if applied to *token*[i].

If the *i*th token already has the correct tag (i.e., if tagged_tokens[i][1] == correctTag), then applicable_rules() should return the empty list.

Parameters:	tokens (list(tuple)) – The tagged tokens being tagged. i (int) – The index of the token whose tag should be corrected. correctTag (any) – The correct tag for the ith token.
Return type:	list(BrillRule)

get_neighborhood(token, index)[source]¶

Returns the set of indices i such that applicable_rules(token, i, ...) depends on the value of the index*th token of *token.

This method is used by the “fast” Brill tagger trainer.

Parameters:	token (list(tuple)) – The tokens being tagged. index (int) – The index whose neighborhood should be returned.
Return type:	set

class nltk.tbl.template.Template(*features)[source]¶

Bases: nltk.tbl.template.BrillTemplateI

A tbl Template that generates a list of L{Rule}s that apply at a given sentence position. In particular, each C{Template} is parameterized by a list of independent features (a combination of a specific property to extract and a list C{L} of relative positions at which to extract it) and generates all Rules that:

use the given features, each at its own independent position; and

are applicable to the given token.

ALLTEMPLATES = []¶

applicable_rules(tokens, index, correct_tag)[source]¶

If the *i*th token already has the correct tag (i.e., if tagged_tokens[i][1] == correctTag), then applicable_rules() should return the empty list.

Parameters:	tokens (list(tuple)) – The tagged tokens being tagged. i (int) – The index of the token whose tag should be corrected. correctTag (any) – The correct tag for the ith token.
Return type:	list(BrillRule)

classmethod expand(featurelists, combinations=None, skipintersecting=True)[source]¶

Factory method to mass generate Templates from a list L of lists of Features.

#With combinations=(k1, k2), the function will in all possible ways choose k1 … k2 #of the sublists in L; it will output all Templates formed by the Cartesian product #of this selection, with duplicates and other semantically equivalent #forms removed. Default for combinations is (1, len(L)).

The feature lists may have been specified manually, or generated from Feature.expand(). For instance,

>>> from nltk.tbl.template import Template
>>> from nltk.tag.brill import Word, Pos

#creating some features >>> (wd_0, wd_01) = (Word([0]), Word([0,1]))

>>> (pos_m2, pos_m33) = (Pos([-2]), Pos([3-2,-1,0,1,2,3]))

>>> list(Template.expand([[wd_0], [pos_m2]]))
[Template(Word([0])), Template(Pos([-2])), Template(Pos([-2]),Word([0]))]

>>> list(Template.expand([[wd_0, wd_01], [pos_m2]]))
[Template(Word([0])), Template(Word([0, 1])), Template(Pos([-2])), Template(Pos([-2]),Word([0])), Template(Pos([-2]),Word([0, 1]))]

#note: with Feature.expand(), it is very easy to generate more templates #than your system can handle – for instance, >>> wordtpls = Word.expand([-2,-1,0,1], [1,2], excludezero=False) >>> len(wordtpls) 7

>>> postpls = Pos.expand([-3,-2,-1,0,1,2], [1,2,3], excludezero=True)
>>> len(postpls)
9

#and now the Cartesian product of all non-empty combinations of two wordtpls and #two postpls, with semantic equivalents removed >>> templates = list(Template.expand([wordtpls, wordtpls, postpls, postpls])) >>> len(templates) 713

will return a list of eight templates

Template(Word([0])), Template(Word([0, 1])), Template(Pos([-2])), Template(Pos([-1])), Template(Pos([-2]),Word([0])), Template(Pos([-1]),Word([0])), Template(Pos([-2]),Word([0, 1])), Template(Pos([-1]),Word([0, 1]))]

#Templates where one feature is a subset of another, such as #Template(Word([0,1]), Word([1]), will not appear in the output. #By default, this non-subset constraint is tightened to disjointness: #Templates of type Template(Word([0,1]), Word([1,2]) will also be filtered out. #With skipintersecting=False, then such Templates are allowed

WARNING: this method makes it very easy to fill all your memory when training generated templates on any real-world corpus

Parameters:

Parameters:	featurelists (list of (list of Features)) – lists of Features, whose Cartesian product will return a set of Templates combinations (None, int, or (int, int)) – given n featurelists: if combinations=k, all generated Templates will have k features; if combinations=(k1,k2) they will have k1..k2 features; if None, defaults to 1..n skipintersecting (bool) – if True, do not output intersecting Templates (non-disjoint positions for some feature)
Returns:	generator of Templates

featurelists (list of (list of Features)) – lists of Features, whose Cartesian product will return a set of Templates
combinations (None, int, or (int, int)) – given n featurelists: if combinations=k, all generated Templates will have k features; if combinations=(k1,k2) they will have k1..k2 features; if None, defaults to 1..n
skipintersecting (bool) – if True, do not output intersecting Templates (non-disjoint positions for some feature)

Returns:

generator of Templates

get_neighborhood(tokens, index)[source]¶

Returns the set of indices i such that applicable_rules(token, i, ...) depends on the value of the index*th token of *token.

This method is used by the “fast” Brill tagger trainer.

Parameters:	token (list(tuple)) – The tokens being tagged. index (int) – The index whose neighborhood should be returned.
Return type:	set

Module contents¶

Transformation Based Learning

A general purpose package for Transformation Based Learning, currently used by nltk.tag.BrillTagger.

nltk.tbl package¶

Submodules¶

nltk.tbl.api module¶

nltk.tbl.demo module¶

nltk.tbl.erroranalysis module¶

nltk.tbl.feature module¶

nltk.tbl.rule module¶

nltk.tbl.template module¶

Module contents¶

Table Of Contents

Search