nltk.tbl package¶
Submodules¶
nltk.tbl.api module¶
nltk.tbl.demo module¶
-
nltk.tbl.demo.
demo
()[source]¶ Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions.
-
nltk.tbl.demo.
demo_error_analysis
()[source]¶ Writes a file with context for each erroneous word after tagging testing data
-
nltk.tbl.demo.
demo_generated_templates
()[source]¶ Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details.
Note: training with 500 templates can easily fill all available even on relatively small corpora
-
nltk.tbl.demo.
demo_high_accuracy_rules
()[source]¶ Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human.
-
nltk.tbl.demo.
demo_learning_curve
()[source]¶ Plot a learning curve – the contribution on tagging accuracy of the individual rules. Note: requires matplotlib
-
nltk.tbl.demo.
demo_multiposition_feature
()[source]¶ The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right.
For contiguous ranges, a 2-arg form giving inclusive end points can also be used: Pos(-3, -1) is the same as the arg below.
-
nltk.tbl.demo.
demo_repr_rule_format
()[source]¶ Exemplify repr(Rule) (see also str(Rule) and Rule.format(“verbose”))
-
nltk.tbl.demo.
demo_serialize_tagger
()[source]¶ Serializes the learned tagger to a file in pickle format; reloads it and validates the process.
-
nltk.tbl.demo.
demo_str_rule_format
()[source]¶ Exemplify repr(Rule) (see also str(Rule) and Rule.format(“verbose”))
-
nltk.tbl.demo.
demo_template_statistics
()[source]¶ Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined.
Deleting unused templates is mostly about saving time and/or space: training is basically O(T) in the number of templates T (also in terms of memory usage, which often will be the limiting factor).
-
nltk.tbl.demo.
postag
(templates=None, tagged_data=None, num_sents=1000, max_rules=300, min_score=3, min_acc=None, train=0.8, trace=3, randomize=False, ruleformat='str', incremental_stats=False, template_stats=False, error_output=None, serialize_output=None, learning_curve_output=None, learning_curve_take=300, baseline_backoff_tagger=None, separate_baseline_data=False, cache_baseline_tagger=None)[source]¶ Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template
Parameters: - tagged_data (C{int}) – maximum number of rule instances to create
- num_sents (C{int}) – how many sentences of training and testing data to use
- max_rules (C{int}) – maximum number of rule instances to create
- min_score (C{int}) – the minimum score for a rule in order for it to be considered
- min_acc (C{float}) – the minimum score for a rule in order for it to be considered
- train (C{float}) – the fraction of the the corpus to be used for training (1=all)
- trace (C{int}) – the level of diagnostic tracing output to produce (0-4)
- randomize (C{bool}) – whether the training data should be a random subset of the corpus
- ruleformat (C{str}) – rule output format, one of “str”, “repr”, “verbose”
- incremental_stats (C{bool}) – if true, will tag incrementally and collect stats for each rule (rather slow)
- template_stats (C{bool}) – if true, will print per-template statistics collected in training and (optionally) testing
- error_output (C{string}) – the file where errors will be saved
- serialize_output (C{string}) – the file where the learned tbl tagger will be saved
- learning_curve_output (C{string}) – filename of plot of learning curve(s) (train and also test, if available)
- learning_curve_take (C{int}) – how many rules plotted
- baseline_backoff_tagger (tagger) – the file where rules will be saved
- separate_baseline_data (C{bool}) – use a fraction of the training data exclusively for training baseline
- cache_baseline_tagger (C{string}) – cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions)
Note on separate_baseline_data: if True, reuse training data both for baseline and rule learner. This is fast and fine for a demo, but is likely to generalize worse on unseen data. Also cannot be sensibly used for learning curves on training data (the baseline will be artificially high).
nltk.tbl.erroranalysis module¶
nltk.tbl.feature module¶
-
class
nltk.tbl.feature.
Feature
(positions, end=None)[source]¶ Bases:
object
An abstract base class for Features. A Feature is a combination of a specific property-computing method and a list of relative positions to apply that method to.
The property-computing method, M{extract_property(tokens, index)}, must be implemented by every subclass. It extracts or computes a specific property for the token at the current index. Typical extract_property() methods return features such as the token text or tag; but more involved methods may consider the entire sequence M{tokens} and for instance compute the length of the sentence the token belongs to.
In addition, the subclass may have a PROPERTY_NAME, which is how it will be printed (in Rules and Templates, etc). If not given, defaults to the classname.
-
PROPERTY_NAME
= None¶
-
classmethod
expand
(starts, winlens, excludezero=False)[source]¶ Return a list of features, one for each start point in starts and for each window length in winlen. If excludezero is True, no Features containing 0 in its positions will be generated (many tbl trainers have a special representation for the target feature at [0])
For instance, importing a concrete subclass (Feature is abstract) >>> from nltk.tag.brill import Word
First argument gives the possible start positions, second the possible window lengths >>> Word.expand([-3,-2,-1], [1]) [Word([-3]), Word([-2]), Word([-1])]
>>> Word.expand([-2,-1], [1]) [Word([-2]), Word([-1])]
>>> Word.expand([-3,-2,-1], [1,2]) [Word([-3]), Word([-2]), Word([-1]), Word([-3, -2]), Word([-2, -1])]
>>> Word.expand([-2,-1], [1]) [Word([-2]), Word([-1])]
a third optional argument excludes all Features whose positions contain zero >>> Word.expand([-2,-1,0], [1,2], excludezero=False) [Word([-2]), Word([-1]), Word([0]), Word([-2, -1]), Word([-1, 0])]
>>> Word.expand([-2,-1,0], [1,2], excludezero=True) [Word([-2]), Word([-1]), Word([-2, -1])]
All window lengths must be positive >>> Word.expand([-2,-1], [0]) Traceback (most recent call last):
File “<stdin>”, line 1, in <module> File “nltk/tag/tbl/template.py”, line 371, in expand
param starts: where to start looking for Feature ValueError: non-positive window length in [0]
Parameters: - starts (list of ints) – where to start looking for Feature
- winlens – window lengths where to look for Feature
- excludezero (bool) – do not output any Feature with 0 in any of its positions.
Returns: list of Features
Raises: ValueError – for non-positive window lengths
-
static
extract_property
(tokens, index)[source]¶ Any subclass of Feature must define static method extract_property(tokens, index)
Parameters: - tokens (list of tokens) – the sequence of tokens
- index (int) – the current index
Returns: feature value
Return type: any (but usually scalar)
-
intersects
(other)[source]¶ Return True if the positions of this Feature intersects with those of other
More precisely, return True if this feature refers to the same property as other; and there is some overlap in the positions they look at.
#For instance, importing a concrete subclass (Feature is abstract) >>> from nltk.tag.brill import Word, Pos
>>> Word([-3,-2,-1]).intersects(Word([-3,-2])) True
>>> Word([-3,-2,-1]).intersects(Word([-3,-2, 0])) True
>>> Word([-3,-2,-1]).intersects(Word([0])) False
#Feature subclasses must agree >>> Word([-3,-2,-1]).intersects(Pos([-3,-2])) False
Parameters: other ((subclass of) Feature) – feature with which to compare Returns: True if feature classes agree and there is some overlap in the positions they look at Return type: bool
-
issuperset
(other)[source]¶ Return True if this Feature always returns True when other does
More precisely, return True if this feature refers to the same property as other; and this Feature looks at all positions that other does (and possibly other positions in addition).
#For instance, importing a concrete subclass (Feature is abstract) >>> from nltk.tag.brill import Word, Pos
>>> Word([-3,-2,-1]).issuperset(Word([-3,-2])) True
>>> Word([-3,-2,-1]).issuperset(Word([-3,-2, 0])) False
#Feature subclasses must agree >>> Word([-3,-2,-1]).issuperset(Pos([-3,-2])) False
Parameters: other ((subclass of) Feature) – feature with which to compare Returns: True if this feature is superset, otherwise False Return type: bool
-
json_tag
= 'nltk.tbl.Feature'¶
-
nltk.tbl.rule module¶
-
class
nltk.tbl.rule.
Rule
(templateid, original_tag, replacement_tag, conditions)[source]¶ Bases:
nltk.tbl.rule.TagRule
A Rule checks the current corpus position for a certain set of conditions; if they are all fulfilled, the Rule is triggered, meaning that it will change tag A to tag B. For other tags than A, nothing happens.
The conditions are parameters to the Rule instance. Each condition is a feature-value pair, with a set of positions to check for the value of the corresponding feature. Conceptually, the positions are joined by logical OR, and the feature set by logical AND.
More formally, the Rule is then applicable to the M{n}th token iff:
The M{n}th token is tagged with the Rule’s original tag; and
For each (Feature(positions), M{value}) tuple: - The value of Feature of at least one token in {n+p for p in positions}
is M{value}.
-
applies
(tokens, index)[source]¶ Returns: True if the rule would change the tag of
tokens[index]
, False otherwiseReturn type: bool
Parameters: - tokens (list(str)) – A tagged sentence
- index (int) – The index to check
-
format
(fmt)[source]¶ Return a string representation of this rule.
>>> from nltk.tbl.rule import Rule >>> from nltk.tag.brill import Pos
>>> r = Rule("23", "VB", "NN", [(Pos([-2,-1]), 'DT')])
r.format(“str”) == str(r) True >>> r.format(“str”) ‘VB->NN if Pos:DT@[-2,-1]’
r.format(“repr”) == repr(r) True >>> r.format(“repr”) “Rule(‘23’, ‘VB’, ‘NN’, [(Pos([-2, -1]),’DT’)])”
>>> r.format("verbose") 'VB -> NN if the Pos of words i-2...i-1 is "DT"'
>>> r.format("not_found") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "nltk/tbl/rule.py", line 256, in format raise ValueError("unknown rule format spec: {0}".format(fmt)) ValueError: unknown rule format spec: not_found >>>
Parameters: fmt (str) – format specification Returns: string representation Return type: str
-
json_tag
= 'nltk.tbl.Rule'¶
-
unicode_repr
()¶ Return repr(self).
-
class
nltk.tbl.rule.
TagRule
(original_tag, replacement_tag)[source]¶ Bases:
object
An interface for tag transformations on a tagged corpus, as performed by tbl taggers. Each transformation finds all tokens in the corpus that are tagged with a specific original tag and satisfy a specific condition, and replaces their tags with a replacement tag. For any given transformation, the original tag, replacement tag, and condition are fixed. Conditions may depend on the token under consideration, as well as any other tokens in the corpus.
Tag rules must be comparable and hashable.
-
applies
(tokens, index)[source]¶ Returns: True if the rule would change the tag of
tokens[index]
, False otherwiseReturn type: bool
Parameters: - tokens (list(str)) – A tagged sentence
- index (int) – The index to check
-
apply
(tokens, positions=None)[source]¶ Apply this rule at every position in positions where it applies to the given sentence. I.e., for each position p in positions, if tokens[p] is tagged with this rule’s original tag, and satisfies this rule’s condition, then set its tag to be this rule’s replacement tag.
Parameters: - tokens (list(tuple(str, str))) – The tagged sentence
- positions (list(int)) – The positions where the transformation is to be tried. If not specified, try it at all positions.
Returns: The indices of tokens whose tags were changed by this rule.
Return type: int
-
original_tag
= None¶ The tag which this TagRule may cause to be replaced.
-
replacement_tag
= None¶ The tag with which this TagRule may replace another tag.
-
nltk.tbl.template module¶
-
class
nltk.tbl.template.
BrillTemplateI
[source]¶ Bases:
object
An interface for generating lists of transformational rules that apply at given sentence positions.
BrillTemplateI
is used byBrill
training algorithms to generate candidate rules.-
applicable_rules
(tokens, i, correctTag)[source]¶ Return a list of the transformational rules that would correct the i*th subtoken’s tag in the given token. In particular, return a list of zero or more rules that would change *tokens*[i][1] to *correctTag, if applied to *token*[i].
If the *i*th token already has the correct tag (i.e., if tagged_tokens[i][1] == correctTag), then
applicable_rules()
should return the empty list.Parameters: - tokens (list(tuple)) – The tagged tokens being tagged.
- i (int) – The index of the token whose tag should be corrected.
- correctTag (any) – The correct tag for the *i*th token.
Return type: list(BrillRule)
-
get_neighborhood
(token, index)[source]¶ Returns the set of indices i such that
applicable_rules(token, i, ...)
depends on the value of the index*th token of *token.This method is used by the “fast” Brill tagger trainer.
Parameters: - token (list(tuple)) – The tokens being tagged.
- index (int) – The index whose neighborhood should be returned.
Return type: set
-
-
class
nltk.tbl.template.
Template
(*features)[source]¶ Bases:
nltk.tbl.template.BrillTemplateI
A tbl Template that generates a list of L{Rule}s that apply at a given sentence position. In particular, each C{Template} is parameterized by a list of independent features (a combination of a specific property to extract and a list C{L} of relative positions at which to extract it) and generates all Rules that:
- use the given features, each at its own independent position; and
- are applicable to the given token.
-
ALLTEMPLATES
= []¶
-
applicable_rules
(tokens, index, correct_tag)[source]¶ Return a list of the transformational rules that would correct the i*th subtoken’s tag in the given token. In particular, return a list of zero or more rules that would change *tokens*[i][1] to *correctTag, if applied to *token*[i].
If the *i*th token already has the correct tag (i.e., if tagged_tokens[i][1] == correctTag), then
applicable_rules()
should return the empty list.Parameters: - tokens (list(tuple)) – The tagged tokens being tagged.
- i (int) – The index of the token whose tag should be corrected.
- correctTag (any) – The correct tag for the *i*th token.
Return type: list(BrillRule)
-
classmethod
expand
(featurelists, combinations=None, skipintersecting=True)[source]¶ Factory method to mass generate Templates from a list L of lists of Features.
#With combinations=(k1, k2), the function will in all possible ways choose k1 … k2 #of the sublists in L; it will output all Templates formed by the Cartesian product #of this selection, with duplicates and other semantically equivalent #forms removed. Default for combinations is (1, len(L)).
The feature lists may have been specified manually, or generated from Feature.expand(). For instance,
>>> from nltk.tbl.template import Template >>> from nltk.tag.brill import Word, Pos
#creating some features >>> (wd_0, wd_01) = (Word([0]), Word([0,1]))
>>> (pos_m2, pos_m33) = (Pos([-2]), Pos([3-2,-1,0,1,2,3]))
>>> list(Template.expand([[wd_0], [pos_m2]])) [Template(Word([0])), Template(Pos([-2])), Template(Pos([-2]),Word([0]))]
>>> list(Template.expand([[wd_0, wd_01], [pos_m2]])) [Template(Word([0])), Template(Word([0, 1])), Template(Pos([-2])), Template(Pos([-2]),Word([0])), Template(Pos([-2]),Word([0, 1]))]
#note: with Feature.expand(), it is very easy to generate more templates #than your system can handle – for instance, >>> wordtpls = Word.expand([-2,-1,0,1], [1,2], excludezero=False) >>> len(wordtpls) 7
>>> postpls = Pos.expand([-3,-2,-1,0,1,2], [1,2,3], excludezero=True) >>> len(postpls) 9
#and now the Cartesian product of all non-empty combinations of two wordtpls and #two postpls, with semantic equivalents removed >>> templates = list(Template.expand([wordtpls, wordtpls, postpls, postpls])) >>> len(templates) 713
- will return a list of eight templates
- Template(Word([0])), Template(Word([0, 1])), Template(Pos([-2])), Template(Pos([-1])), Template(Pos([-2]),Word([0])), Template(Pos([-1]),Word([0])), Template(Pos([-2]),Word([0, 1])), Template(Pos([-1]),Word([0, 1]))]
#Templates where one feature is a subset of another, such as #Template(Word([0,1]), Word([1]), will not appear in the output. #By default, this non-subset constraint is tightened to disjointness: #Templates of type Template(Word([0,1]), Word([1,2]) will also be filtered out. #With skipintersecting=False, then such Templates are allowed
WARNING: this method makes it very easy to fill all your memory when training generated templates on any real-world corpus
Parameters: - featurelists (list of (list of Features)) – lists of Features, whose Cartesian product will return a set of Templates
- combinations (None, int, or (int, int)) – given n featurelists: if combinations=k, all generated Templates will have k features; if combinations=(k1,k2) they will have k1..k2 features; if None, defaults to 1..n
- skipintersecting (bool) – if True, do not output intersecting Templates (non-disjoint positions for some feature)
Returns: generator of Templates
-
get_neighborhood
(tokens, index)[source]¶ Returns the set of indices i such that
applicable_rules(token, i, ...)
depends on the value of the index*th token of *token.This method is used by the “fast” Brill tagger trainer.
Parameters: - token (list(tuple)) – The tokens being tagged.
- index (int) – The index whose neighborhood should be returned.
Return type: set
Module contents¶
Transformation Based Learning
A general purpose package for Transformation Based Learning, currently used by nltk.tag.BrillTagger.