2 String Representation for Tagged Tokens
Tagged tokens are often written using the form 'fly/NN'. The
nltk.tag module provides utility functions to convert between this
string representation and the tuple representation:
|
>>> import nltk.tag
>>> print nltk.tag.tuple2str(tagged_tok)
fly/NN
>>> print nltk.tag.str2tuple('the/DT')
('the', 'DT')
|
|
To convert an entire sentence from the string format to the tuple
format, we simply tokenize the sentence and then apply str2tuple
to each word:
|
>>> sent = 'The/DT cat/NN sat/VBD on/IN the/DT mat/NN ./.'
>>> [nltk.tag.str2tuple(w) for w in sent.split()]
[('The', 'DT'), ('cat', 'NN'), ('sat', 'VBD'), ('on', 'IN'),
('the', 'DT'), ('mat', 'NN'), ('.', '.')]
|
|
Similarly, we can convert from a list of tagged tuples to a single
string by combining tuple2str with the string join method:
|
>>> sent = [('The', 'DT'), ('cat', 'NN'), ('yawned', 'VBD')]
>>> ' '.join([nltk.tag.tuple2str(w) for w in sent])
'The/DT cat/NN yawned/VBD'
|
|
3 Taggers
The nltk.tag module defines several taggers, which take a token
list (typically a sentence), assign a tag to each token, and return
the resulting list tagged of tagged tokens. Most of the taggers
defined in the nltk.tag module are built automatically based on a
training corpus. For example, the unigram tagger tags each word w
by checking what the most frequent tag for w was in a training
corpus:
|
>>>
>>> from nltk.corpus import brown
>>> tagger = nltk.UnigramTagger(brown.tagged_sents()[:500])
>>> tagger.tag(brown.sents()[501])
[('Mitchell', 'NP'), ('decried', None), ('the', 'AT'), ('high', 'JJ'),
('rate', None), ('of', 'IN'), ('unemployment', None), ...]
|
|
Note that words that the tagger has not seen before, such as
decried, receive a tag of None.
In the examples below, we'll look at developing automatic
part-of-speech taggers based on the Brown Corpus. Here are the
training & test sets we'll use:
|
>>> brown_train = brown.tagged_sents(categories='a')[100:]
>>> brown_test = brown.tagged_sents(categories='b')[:100]
>>> test_sent = nltk.tag.untag(brown_test[0])
|
|
(Note that these are on the small side, to make the tests run faster
-- for real-world use, you would probably want to train on more data.)
3.1 Default Tagger
The simplest tagger is the DefaultTagger, which just applies the
same tag to all tokens:
|
>>> default_tagger = nltk.DefaultTagger('XYZ')
>>> default_tagger.tag('This is a test'.split())
[('This', 'XYZ'), ('is', 'XYZ'), ('a', 'XYZ'), ('test', 'XYZ')]
|
|
Since 'NN' is the most frequent tag in the Brown corpus, we can
use a tagger that assigns 'NN' to all words as a baseline.
|
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(test_sent)
[('Assembly', 'NN'), ('session', 'NN'), ('brought', 'NN'),
('much', 'NN'), ('good', 'NN')]
|
|
Using this baseline, we achieve about a fairly low accuracy:
|
>>> print 'Accuracy: %4.1f%%' % (
... 100.0 * nltk.tag.accuracy(default_tagger, brown_test))
Accuracy: 16.0%
|
|
3.2 Regexp Tagger
The RegexpTagger class assigns tags to tokens by comparing their
word strings to a series of regular expressions. The following tagger
uses word suffixes to make guesses about the correct Brown Corpus part
of speech tag:
|
>>> regexp_tagger = nltk.RegexpTagger(
... [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
... (r'(The|the|A|a|An|an)$', 'AT'),
... (r'.*able$', 'JJ'),
... (r'.*ness$', 'NN'),
... (r'.*ly$', 'RB'),
... (r'.*s$', 'NNS'),
... (r'.*ing$', 'VBG'),
... (r'.*ed$', 'VBD'),
... (r'.*', 'NN')
... ])
>>> regexp_tagger.tag(test_sent)
[('Assembly', 'RB'), ('session', 'NN'), ('brought', 'NN'),
('much', 'NN'), ('good', 'NN')]
|
|
This gives us a higher score than the default tagger, but accuracy is
still fairly low:
|
>>> print 'Accuracy: %4.1f%%' % (
... 100.0 * nltk.tag.accuracy(regexp_tagger, brown_test))
Accuracy: 33.8%
|
|
3.3 Unigram Tagger
As mentioned above, the UnigramTagger class finds the most likely
tag for each word in a training corpus, and then uses that information
to assign tags to new tokens.
|
>>> unigram_tagger = nltk.UnigramTagger(brown_train)
>>> unigram_tagger.tag(test_sent)
[('Assembly', 'NN-TL'), ('session', 'NN'), ('brought', 'VBD'), ('much', 'AP'), ('good', 'JJ')]
|
|
This gives us a significantly higher accuracy score than the default
tagger or the regexp tagger:
|
>>> print 'Accuracy: %4.1f%%' % (
... 100.0 * nltk.tag.accuracy(unigram_tagger, brown_test))
Accuracy: 76.4%
|
|
As was mentioned above, the unigram tagger will assign a tag of
None to any words that it never saw in the training data. We can
avoid this problem by providing the unigram tagger with a backoff
tagger, which will be used whenever the unigram tagger is unable to
choose a tag:
|
>>> unigram_tagger_2 = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
>>> print 'Accuracy: %4.1f%%' % (
... 100.0 * nltk.tag.accuracy(unigram_tagger_2, brown_test))
Accuracy: 83.5%
|
|
Using a backoff tagger has another advantage, as well -- it allows us
to build a more compact unigram tagger, because the unigram tagger
doesn't need to explicitly store the tags for words that the backoff
tagger would get right anyway. We can see this by using the size()
method, which reports the number of words that a unigram tagger has
stored the most likely tag for.
|
>>> print unigram_tagger.size()
6276
>>> print unigram_tagger_2.size()
3935
|
|
3.4 Bigram Tagger
The bigram tagger is similar to the unigram tagger, except that it
finds the most likely tag for each word, given the preceding tag.
(It is called a "bigram" tagger because it uses two pieces of
information -- the current word, and the previous tag.) When
training, it can look up the preceding tag directly. When run on new
data, it works through the sentence from left to right, and uses the
tag that it just generated for the preceding word.
|
>>> bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger_2)
>>> print bigram_tagger.size()
1000
>>> print 'Accuracy: %4.1f%%' % (
... 100.0 * nltk.tag.accuracy(bigram_tagger, brown_test))
Accuracy: 84.7%
|
|
3.5 Trigram Tagger & N-Gram Tagger
Similarly, the trigram tagger finds the most likely tag for a word,
given the preceding two tags; and the n-gram tagger finds the most
likely tag for a word, given the preceding n-1 tags. However, these
higher-order taggers are only likely to improve performance if there
is a large amount of training data available; otherwise, the sequences
that they consider do not occur often enough to gather reliable
statistics.
|
>>> trigram_tagger = nltk.TrigramTagger(brown_train, backoff=bigram_tagger)
>>> print trigram_tagger.size()
543
>>> print 'Accuracy: %4.1f%%' % (
... 100.0 * nltk.tag.accuracy(trigram_tagger, brown_test))
Accuracy: 84.6%
|
|
3.6 Affix Tagger
The affix tagger is similar to the unigram tagger, except that it
takes some fixed-size substring of the word, and finds the most likely
tag for that substring. It can be used to look at either suffixes or
prefixes -- use a positive affix size for prefixes and a negative
affix size for suffixes. The min_stem_length argument specifies
the minimum size for "word stems" -- for any word that is longer than
min_stem_length+abs(affix_length), the backoff tagger will be
used.
|
>>> affix_tagger = nltk.AffixTagger(
... brown_train, affix_length=-3, min_stem_length=2,
... backoff=default_tagger)
>>> print 'Accuracy: %4.1f%%' % (
... 100.0 * nltk.tag.accuracy(affix_tagger, brown_test))
Accuracy: 32.1%
|
|
3.7 Brill Tagger
The Brill Tagger starts by running an initial tagger, and then
improves the tagging by applying a list of transformation rules.
These transformation rules are automatically learned from the training
corpus, based on one or more "rule templates."
|
>>> from nltk.tag.brill import *
>>> templates = [
... SymmetricProximateTokensTemplate(ProximateTagsRule, (1,1)),
... SymmetricProximateTokensTemplate(ProximateTagsRule, (2,2)),
... SymmetricProximateTokensTemplate(ProximateTagsRule, (1,2)),
... SymmetricProximateTokensTemplate(ProximateTagsRule, (1,3)),
... SymmetricProximateTokensTemplate(ProximateWordsRule, (1,1)),
... SymmetricProximateTokensTemplate(ProximateWordsRule, (2,2)),
... SymmetricProximateTokensTemplate(ProximateWordsRule, (1,2)),
... SymmetricProximateTokensTemplate(ProximateWordsRule, (1,3)),
... ProximateTokensTemplate(ProximateTagsRule, (-1, -1), (1,1)),
... ProximateTokensTemplate(ProximateWordsRule, (-1, -1), (1,1)),
... ]
>>> trainer = FastBrillTaggerTrainer(initial_tagger=unigram_tagger_2,
... templates=templates, trace=3,
... deterministic=True)
>>> brill_tagger = trainer.train(brown_train, max_rules=10)
Training Brill tagger on 4523 sentences...
Finding initial useful rules...
Found 117410 useful rules.
B |
S F r O | Score = Fixed - Broken
c i o t | R Fixed = num tags changed incorrect -> correct
o x k h | u Broken = num tags changed correct -> incorrect
r e e e | l Other = num tags changed incorrect -> incorrect
e d n r | e
------------------+-------------------------------------------------------
354 354 0 3 | TO -> IN if the tag of the following word is 'AT'
236 462 226 45 | NN -> NP if the tag of the preceding word is 'NP'
226 288 62 31 | NN -> VB if the tag of the preceding word is 'TO'
111 129 18 14 | NN -> VB if the tag of words i-2...i-1 is 'MD'
103 107 4 0 | VBD -> VBN if the tag of words i-2...i-1 is 'BEDZ'
101 101 0 3 | TO -> IN if the tag of the following word is 'NP'
84 85 1 2 | VBD -> VBN if the tag of words i-3...i-1 is 'BE'
79 108 29 4 | VB -> NN if the tag of words i-2...i-1 is 'AT'
73 127 54 13 | NP -> NP-TL if the tag of the following word is
| 'NN-TL'
72 121 49 4 | TO -> IN if the tag of words i+1...i+2 is 'NNS'
>>> print 'Accuracy: %4.1f%%' % (
... 100.0 * nltk.tag.accuracy(brill_tagger, brown_test))
Accuracy: 85.2%
|
|
3.8 HMM Tagger
The HMM tagger uses a hidden markov model to find the most likely tag
sequence for each sentence. (Note: this requires numpy.)
|
>>> from nltk.tag.hmm import *
|
|
Demo code lifted more or less directly from the HMM class.
|
>>> symbols = ['up', 'down', 'unchanged']
>>> states = ['bull', 'bear', 'static']
|
|
|
>>> def probdist(values, samples):
... d = {}
... for value, item in zip(values, samples):
... d[item] = value
... return DictionaryProbDist(d)
|
|
|
>>> def conditionalprobdist(array, conditions, samples):
... d = {}
... for values, condition in zip(array, conditions):
... d[condition] = probdist(values, samples)
... return DictionaryConditionalProbDist(d)
|
|
|
>>> A = array([[0.6, 0.2, 0.2], [0.5, 0.3, 0.2], [0.4, 0.1, 0.5]], float64)
>>> A = conditionalprobdist(A, states, states)
|
|
|
>>> B = array([[0.7, 0.1, 0.2], [0.1, 0.6, 0.3], [0.3, 0.3, 0.4]], float64)
>>> B = conditionalprobdist(B, states, symbols)
|
|
|
>>> pi = array([0.5, 0.2, 0.3], float64)
>>> pi = probdist(pi, states)
|
|
|
>>> model = HiddenMarkovModelTagger(symbols=symbols, states=states,
... transitions=A, outputs=B, priors=pi)
|
|
|
>>> test = ['up', 'down', 'up']
>>> sequence = [(t, None) for t in test]
|
|
|
>>> print '%.3f' % (model.probability(sequence))
0.051
|
|
Check the test sequence by hand -- calculate the joint probability for
each possible state sequence, and verify that they're equal to what
the model gives; then verify that their total is equal to what the
model gives for the probability of the sequence w/ no states specified.
|
>>> seqs = [zip(test, [a,b,c]) for a in states
... for b in states for c in states]
>>> total = 0
>>> for seq in seqs:
...
... expect = pi.prob(seq[0][1])
... for (o,s) in seq:
... expect *= B[s].prob(o)
... for i in range(1, len(seq)):
... expect *= A[seq[i-1][1]].prob(seq[i][1])
...
... assert abs(expect-model.probability(seq)) < 1e-10
... total += model.probability(seq)
>>> assert abs(total-model.probability(sequence)) < 1e-10
|
|
Find the most likely set of tags for the test sequence.
|
>>> model.tag(test)
[('up', 'bull'), ('down', 'bear'), ('up', 'bull')]
|
|
Find some entropy values. These are all in base 2 (i.e., bits).
|
>>> print '%.3f' % (model.entropy(sequence))
3.401
|
|
|
>>> print '%.3f' % (model._exhaustive_entropy(sequence))
3.401
|
|
|
>>> model.point_entropy(sequence)
array([ 0.99392864, 1.54508687, 0.97119001])
|
|
|
>>> model._exhaustive_point_entropy(sequence)
array([ 0.99392864, 1.54508687, 0.97119001])
|
|