nltk.tag package¶
Submodules¶
nltk.tag.api module¶
Interface for tagging each token in a sentence with supplementary information, such as its part of speech.
-
class
nltk.tag.api.
FeaturesetTaggerI
[source]¶ Bases:
nltk.tag.api.TaggerI
A tagger that requires tokens to be
featuresets
. A featureset is a dictionary that maps from feature names to feature values. Seenltk.classify
for more information about features and featuresets.
-
class
nltk.tag.api.
TaggerI
[source]¶ Bases:
object
A processing interface for assigning a tag to each token in a list. Tags are case sensitive strings that identify some property of each token, such as its part of speech or its sense.
Some taggers require specific types for their tokens. This is generally indicated by the use of a sub-interface to
TaggerI
. For example, featureset taggers, which are subclassed fromFeaturesetTagger
, require that each token be afeatureset
.- Subclasses must define:
- either
tag()
ortag_sents()
(or both)
- either
-
evaluate
(gold)[source]¶ Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.
Parameters: gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on. Return type: float
nltk.tag.brill module¶
-
class
nltk.tag.brill.
BrillTagger
(initial_tagger, rules, training_stats=None)[source]¶ Bases:
nltk.tag.api.TaggerI
Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as
tag.DefaultTagger
) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by theTagRule
interface.Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using one of the TaggerTrainers available.
-
batch_tag_incremental
(sequences, gold)[source]¶ Tags by applying each rule to the entire corpus (rather than all rules to a single sequence). The point is to collect statistics on the test set for individual rules.
NOTE: This is inefficient (does not build any index, so will traverse the entire corpus N times for N rules) – usually you would not care about statistics for individual rules and thus use batch_tag() instead
Parameters: - sequences (list of list of strings) – lists of token sequences (sentences, in some applications) to be tagged
- gold (list of list of strings) – the gold standard
Returns: tuple of (tagged_sequences, ordered list of rule scores (one for each rule))
-
json_tag
= 'nltk.tag.BrillTagger'¶
-
print_template_statistics
(test_stats=None, printunused=True)[source]¶ Print a list of all templates, ranked according to efficiency.
If test_stats is available, the templates are ranked according to their relative contribution (summed for all rules created from a given template, weighted by score) to the performance on the test set. If no test_stats, then statistics collected during training are used instead. There is also an unweighted measure (just counting the rules). This is less informative, though, as many low-score rules will appear towards end of training.
Parameters: - test_stats (dict of str -> any (but usually numbers)) – dictionary of statistics collected during testing
- printunused (bool) – if True, print a list of all unused templates
Returns: None
Return type: None
-
rules
()[source]¶ Return the ordered list of transformation rules that this tagger has learnt
Returns: the ordered list of transformation rules that correct the initial tagging Return type: list of Rules
-
-
class
nltk.tag.brill.
Pos
(positions, end=None)[source]¶ Bases:
nltk.tbl.feature.Feature
Feature which examines the tags of nearby tokens.
-
json_tag
= 'nltk.tag.brill.Pos'¶
-
-
class
nltk.tag.brill.
Word
(positions, end=None)[source]¶ Bases:
nltk.tbl.feature.Feature
Feature which examines the text (word) of nearby tokens.
-
json_tag
= 'nltk.tag.brill.Word'¶
-
-
nltk.tag.brill.
describe_template_sets
()[source]¶ Print the available template sets in this demo, with a short description”
-
nltk.tag.brill.
fntbl37
()[source]¶ Return 37 templates taken from the postagging task of the fntbl distribution http://www.cs.jhu.edu/~rflorian/fntbl/ (37 is after excluding a handful which do not condition on Pos[0]; fntbl can do that but the current nltk implementation cannot.)
nltk.tag.brill_trainer module¶
-
class
nltk.tag.brill_trainer.
BrillTaggerTrainer
(initial_tagger, templates, trace=0, deterministic=None, ruleformat='str')[source]¶ Bases:
object
A trainer for tbl taggers.
-
train
(train_sents, max_rules=200, min_score=2, min_acc=None)[source]¶ Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score, and each of which has accuracy not lower than min_acc.
#imports >>> from nltk.tbl.template import Template >>> from nltk.tag.brill import Pos, Word >>> from nltk.tag import untag, RegexpTagger, BrillTaggerTrainer
#some data >>> from nltk.corpus import treebank >>> training_data = treebank.tagged_sents()[:100] >>> baseline_data = treebank.tagged_sents()[100:200] >>> gold_data = treebank.tagged_sents()[200:300] >>> testing_data = [untag(s) for s in gold_data]
>>> backoff = RegexpTagger([ ... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers ... (r'(The|the|A|a|An|an)$', 'AT'), # articles ... (r'.*able$', 'JJ'), # adjectives ... (r'.*ness$', 'NN'), # nouns formed from adjectives ... (r'.*ly$', 'RB'), # adverbs ... (r'.*s$', 'NNS'), # plural nouns ... (r'.*ing$', 'VBG'), # gerunds ... (r'.*ed$', 'VBD'), # past tense verbs ... (r'.*', 'NN') # nouns (default) ... ])
>>> baseline = backoff #see NOTE1
>>> baseline.evaluate(gold_data) 0.2450142...
#templates >>> Template._cleartemplates() #clear any templates created in earlier tests >>> templates = [Template(Pos([-1])), Template(Pos([-1]), Word([0]))]
#construct a BrillTaggerTrainer >>> tt = BrillTaggerTrainer(baseline, templates, trace=3)
>>> tagger1 = tt.train(training_data, max_rules=10) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: None) Finding initial useful rules... Found 845 useful rules. B | S F r O | Score = Fixed - Broken c i o t | R Fixed = num tags changed incorrect -> correct o x k h | u Broken = num tags changed correct -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect e d n r | e ------------------+------------------------------------------------------- 132 132 0 0 | AT->DT if Pos:NN@[-1] 85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 47 63 16 161 | NN->IN if Pos:NNS@[-1] 33 33 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | IN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | IN->, if Pos:NNS@[-1] & Word:,@[0] 22 27 5 24 | NN->-NONE- if Pos:VBD@[-1] 17 17 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]
>>> tagger1.rules()[1:3] (Rule('001', 'NN', ',', [(Pos([-1]),'NN'), (Word([0]),',')]), Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]))
>>> train_stats = tagger1.train_stats() >>> [train_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']] [1775, 1269, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]]
>>> tagger1.print_template_statistics(printunused=False) TEMPLATE STATISTICS (TRAIN) 2 templates, 10 rules) TRAIN ( 2417 tokens) initial 1775 0.2656 final: 1269 0.4750 #ID | Score (train) | #Rules | Template -------------------------------------------- 001 | 305 0.603 | 7 0.700 | Template(Pos([-1]),Word([0])) 000 | 201 0.397 | 3 0.300 | Template(Pos([-1]))
>>> tagger1.evaluate(gold_data) 0.43996...
>>> tagged, test_stats = tagger1.batch_tag_incremental(testing_data, gold_data)
>>> tagged[33][12:] == [('foreign', 'IN'), ('debt', 'NN'), ('of', 'IN'), ('$', 'NN'), ('64', 'CD'), ... ('billion', 'NN'), ('*U*', 'NN'), ('--', 'NN'), ('the', 'DT'), ('third-highest', 'NN'), ('in', 'NN'), ... ('the', 'DT'), ('developing', 'VBG'), ('world', 'NN'), ('.', '.')] True
>>> [test_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']] [1855, 1376, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]]
# a high-accuracy tagger >>> tagger2 = tt.train(training_data, max_rules=10, min_acc=0.99) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: 0.99) Finding initial useful rules…
Found 845 useful rules.- <BLANKLINE>
- B |
S F r O | Score = Fixed - Broken c i o t | R Fixed = num tags changed incorrect -> correct o x k h | u Broken = num tags changed correct -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect e d n r | e
- ——————+——————————————————-
- 132 132 0 0 | AT->DT if Pos:NN@[-1]
- 85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 36 36 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | NN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | NN->, if Pos:NNS@[-1] & Word:,@[0] 19 19 0 6 | NN->VB if Pos:TO@[-1] 18 18 0 0 | CD->-NONE- if Pos:NN@[-1] & Word:0@[0] 18 18 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]
>>> tagger2.evaluate(gold_data) 0.44159544... >>> tagger2.rules()[2:4] (Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]), Rule('001', 'NN', 'IN', [(Pos([-1]),'NN'), (Word([0]),'of')]))
# NOTE1: (!!FIXME) A far better baseline uses nltk.tag.UnigramTagger, # with a RegexpTagger only as backoff. For instance, # >>> baseline = UnigramTagger(baseline_data, backoff=backoff) # However, as of Nov 2013, nltk.tag.UnigramTagger does not yield consistent results # between python versions. The simplistic backoff above is a workaround to make doctests # get consistent input.
Parameters: - train_sents (list(list(tuple))) – training data
- max_rules (int) – output at most max_rules rules
- min_score (int) – stop training when no rules better than min_score can be found
- min_acc (float or None) – discard any rule with lower accuracy than min_acc
Returns: the learned tagger
Return type:
-
nltk.tag.crf module¶
A module for POS tagging using CRFSuite
-
class
nltk.tag.crf.
CRFTagger
(feature_func=None, verbose=False, training_opt={})[source]¶ Bases:
nltk.tag.api.TaggerI
A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite
>>> from nltk.tag import CRFTagger >>> ct = CRFTagger()
>>> train_data = [[('University','Noun'), ('is','Verb'), ('a','Det'), ('good','Adj'), ('place','Noun')], ... [('dog','Noun'),('eat','Verb'),('meat','Noun')]]
>>> ct.train(train_data,'model.crf.tagger') >>> ct.tag_sents([['dog','is','good'], ['Cat','eat','meat']]) [[('dog', 'Noun'), ('is', 'Verb'), ('good', 'Adj')], [('Cat', 'Noun'), ('eat', 'Verb'), ('meat', 'Noun')]]
>>> gold_sentences = [[('dog','Noun'),('is','Verb'),('good','Adj')] , [('Cat','Noun'),('eat','Verb'), ('meat','Noun')]] >>> ct.evaluate(gold_sentences) 1.0
Setting learned model file >>> ct = CRFTagger() >>> ct.set_model_file(‘model.crf.tagger’) >>> ct.evaluate(gold_sentences) 1.0
-
tag
(tokens)[source]¶ - Tag a sentence using Python CRFSuite Tagger. NB before using this function, user should specify the mode_file either by
:params tokens : list of tokens needed to tag. :type tokens : list(str) :return : list of tagged tokens. :rtype : list (tuple(str,str))
-
nltk.tag.hmm module¶
Hidden Markov Models (HMMs) largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence. These models are finite state machines characterised by a number of states, transitions between these states, and output symbols emitted while in each state. The HMM is an extension to the Markov chain, where each state corresponds deterministically to a given event. In the HMM the observation is a probabilistic function of the state. HMMs share the Markov chain’s assumption, being that the probability of transition from one state to another only depends on the current state - i.e. the series of states that led to the current state are not used. They are also time invariant.
The HMM is a directed graph, with probability weighted edges (representing the probability of a transition between the source and sink states) where each vertex emits an output symbol when entered. The symbol (or observation) is non-deterministically generated. For this reason, knowing that a sequence of output observations was generated by a given HMM does not mean that the corresponding sequence of states (and what the current state is) is known. This is the ‘hidden’ in the hidden markov model.
Formally, a HMM can be characterised by:
- the output observation alphabet. This is the set of symbols which may be observed as output of the system.
- the set of states.
- the transition probabilities a_{ij} = P(s_t = j | s_{t-1} = i). These represent the probability of transition to each state from a given state.
- the output probability matrix b_i(k) = P(X_t = o_k | s_t = i). These represent the probability of observing each symbol in a given state.
- the initial state distribution. This gives the probability of starting in each state.
To ground this discussion, take a common NLP application, part-of-speech (POS) tagging. An HMM is desirable for this task as the highest probability tag sequence can be calculated for a given sequence of word forms. This differs from other tagging techniques which often tag each word individually, seeking to optimise each individual tagging greedily without regard to the optimal combination of tags for a larger unit, such as a sentence. The HMM does this with the Viterbi algorithm, which efficiently computes the optimal path through the graph given the sequence of words forms.
In POS tagging the states usually have a 1:1 correspondence with the tag alphabet - i.e. each state represents a single tag. The output observation alphabet is the set of word forms (the lexicon), and the remaining three parameters are derived by a training regime. With this information the probability of a given sentence can be easily derived, by simply summing the probability of each distinct path through the model. Similarly, the highest probability tagging sequence can be derived with the Viterbi algorithm, yielding a state sequence which can be mapped into a tag sequence.
This discussion assumes that the HMM has been trained. This is probably the most difficult task with the model, and requires either MLE estimates of the parameters or unsupervised learning using the Baum-Welch algorithm, a variant of EM.
For more information, please consult the source code for this module, which includes extensive demonstration code.
-
class
nltk.tag.hmm.
HiddenMarkovModelTagger
(symbols, states, transitions, outputs, priors, transform=<function _identity>)[source]¶ Bases:
nltk.tag.api.TaggerI
Hidden Markov model class, a generative model for labelling sequence data. These models define the joint probability of a sequence of symbols and their labels (state transitions) as the product of the starting state probability, the probability of each state transition, and the probability of each observation being generated from each state. This is described in more detail in the module documentation.
This implementation is based on the HMM description in Chapter 8, Huang, Acero and Hon, Spoken Language Processing and includes an extension for training shallow HMM parsers or specialized HMMs as in Molina et. al, 2002. A specialized HMM modifies training data by applying a specialization function to create a new training set that is more appropriate for sequential tagging with an HMM. A typical use case is chunking.
Parameters: - symbols (seq of any) – the set of output symbols (alphabet)
- states (seq of any) – a set of states representing state space
- transitions (ConditionalProbDistI) – transition probabilities; Pr(s_i | s_j) is the probability of transition from state i given the model is in state_j
- outputs (ConditionalProbDistI) – output probabilities; Pr(o_k | s_i) is the probability of emitting symbol k when entering state i
- priors (ProbDistI) – initial state distribution; Pr(s_i) is the probability of starting in state i
- transform (callable) – an optional function for transforming training instances, defaults to the identity function.
-
best_path
(unlabeled_sequence)[source]¶ Returns the state sequence of the optimal (most probable) path through the HMM. Uses the Viterbi algorithm to calculate this part by dynamic programming.
Returns: the state sequence Return type: sequence of any Parameters: unlabeled_sequence (list) – the sequence of unlabeled symbols
-
best_path_simple
(unlabeled_sequence)[source]¶ Returns the state sequence of the optimal (most probable) path through the HMM. Uses the Viterbi algorithm to calculate this part by dynamic programming. This uses a simple, direct method, and is included for teaching purposes.
Returns: the state sequence Return type: sequence of any Parameters: unlabeled_sequence (list) – the sequence of unlabeled symbols
-
entropy
(unlabeled_sequence)[source]¶ Returns the entropy over labellings of the given sequence. This is given by:
H(O) = - sum_S Pr(S | O) log Pr(S | O)
where the summation ranges over all state sequences, S. Let Z = Pr(O) = sum_S Pr(S, O)} where the summation ranges over all state sequences and O is the observation sequence. As such the entropy can be re-expressed as:
H = - sum_S Pr(S | O) log [ Pr(S, O) / Z ] = log Z - sum_S Pr(S | O) log Pr(S, 0) = log Z - sum_S Pr(S | O) [ log Pr(S_0) + sum_t Pr(S_t | S_{t-1}) + sum_t Pr(O_t | S_t) ]
The order of summation for the log terms can be flipped, allowing dynamic programming to be used to calculate the entropy. Specifically, we use the forward and backward probabilities (alpha, beta) giving:
H = log Z - sum_s0 alpha_0(s0) beta_0(s0) / Z * log Pr(s0) + sum_t,si,sj alpha_t(si) Pr(sj | si) Pr(O_t+1 | sj) beta_t(sj) / Z * log Pr(sj | si) + sum_t,st alpha_t(st) beta_t(st) / Z * log Pr(O_t | st)
This simply uses alpha and beta to find the probabilities of partial sequences, constrained to include the given state(s) at some point in time.
-
log_probability
(sequence)[source]¶ Returns the log-probability of the given symbol sequence. If the sequence is labelled, then returns the joint log-probability of the symbol, state sequence. Otherwise, uses the forward algorithm to find the log-probability over all label sequences.
Returns: the log-probability of the sequence Return type: float Parameters: sequence (Token) – the sequence of symbols which must contain the TEXT property, and optionally the TAG property
-
point_entropy
(unlabeled_sequence)[source]¶ Returns the pointwise entropy over the possible states at each position in the chain, given the observation sequence.
-
probability
(sequence)[source]¶ Returns the probability of the given symbol sequence. If the sequence is labelled, then returns the joint probability of the symbol, state sequence. Otherwise, uses the forward algorithm to find the probability over all label sequences.
Returns: the probability of the sequence Return type: float Parameters: sequence (Token) – the sequence of symbols which must contain the TEXT property, and optionally the TAG property
-
random_sample
(rng, length)[source]¶ Randomly sample the HMM to generate a sentence of a given length. This samples the prior distribution then the observation distribution and transition distribution for each subsequent observation and state. This will mostly generate unintelligible garbage, but can provide some amusement.
Returns: the randomly created state/observation sequence, generated according to the HMM’s probability distributions. The SUBTOKENS have TEXT and TAG properties containing the observation and state respectively.
Return type: list
Parameters: - rng (Random (or any object with a random() method)) – random number generator
- length (int) – desired output length
-
tag
(unlabeled_sequence)[source]¶ Tags the sequence with the highest probability state sequence. This uses the best_path method to find the Viterbi path.
Returns: a labelled sequence of symbols Return type: list Parameters: unlabeled_sequence (list) – the sequence of unlabeled symbols
-
test
(test_sequence, verbose=False, **kwargs)[source]¶ Tests the HiddenMarkovModelTagger instance.
Parameters: - test_sequence (list(list)) – a sequence of labeled test instances
- verbose (bool) – boolean flag indicating whether training should be verbose or include printed output
-
classmethod
train
(labeled_sequence, test_sequence=None, unlabeled_sequence=None, **kwargs)[source]¶ Train a new HiddenMarkovModelTagger using the given labeled and unlabeled training instances. Testing will be performed if test instances are provided.
Returns: a hidden markov model tagger
Return type: Parameters: - labeled_sequence (list(list)) – a sequence of labeled training instances, i.e. a list of sentences represented as tuples
- test_sequence (list(list)) – a sequence of labeled test instances
- unlabeled_sequence (list(list)) – a sequence of unlabeled training instances, i.e. a list of sentences represented as words
- transform (function) – an optional function for transforming training
instances, defaults to the identity function, see
transform()
- estimator (class or function) – an optional function or class that maps a condition’s frequency distribution to its probability distribution, defaults to a Lidstone distribution with gamma = 0.1
- verbose (bool) – boolean flag indicating whether training should be verbose or include printed output
- max_iterations (int) – number of Baum-Welch interations to perform
-
unicode_repr
()¶ Return repr(self).
-
class
nltk.tag.hmm.
HiddenMarkovModelTrainer
(states=None, symbols=None)[source]¶ Bases:
object
Algorithms for learning HMM parameters from training data. These include both supervised learning (MLE) and unsupervised learning (Baum-Welch).
Creates an HMM trainer to induce an HMM with the given states and output symbol alphabet. A supervised and unsupervised training method may be used. If either of the states or symbols are not given, these may be derived from supervised training.
Parameters: - states (sequence of any) – the set of state labels
- symbols (sequence of any) – the set of observation symbols
-
train
(labeled_sequences=None, unlabeled_sequences=None, **kwargs)[source]¶ Trains the HMM using both (or either of) supervised and unsupervised techniques.
Returns: the trained model
Return type: Parameters: - labelled_sequences (list) – the supervised training data, a set of labelled sequences of observations ex: [ (word_1, tag_1),…,(word_n,tag_n) ]
- unlabeled_sequences (list) – the unsupervised training data, a set of sequences of observations ex: [ word_1, …, word_n ]
- kwargs – additional arguments to pass to the training methods
-
train_supervised
(labelled_sequences, estimator=None)[source]¶ Supervised training maximising the joint probability of the symbol and state sequences. This is done via collecting frequencies of transitions between states, symbol observations while within each state and which states start a sentence. These frequency distributions are then normalised into probability estimates, which can be smoothed if desired.
Returns: the trained model
Return type: Parameters: - labelled_sequences (list) – the training data, a set of labelled sequences of observations
- estimator – a function taking a FreqDist and a number of bins and returning a CProbDistI; otherwise a MLE estimate is used
-
train_unsupervised
(unlabeled_sequences, update_outputs=True, **kwargs)[source]¶ Trains the HMM using the Baum-Welch algorithm to maximise the probability of the data sequence. This is a variant of the EM algorithm, and is unsupervised in that it doesn’t need the state sequences for the symbols. The code is based on ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’, Lawrence Rabiner, IEEE, 1989.
Returns: the trained model Return type: HiddenMarkovModelTagger Parameters: unlabeled_sequences (list) – the training data, a set of sequences of observations kwargs may include following parameters:
Parameters: - model – a HiddenMarkovModelTagger instance used to begin the Baum-Welch algorithm
- max_iterations – the maximum number of EM iterations
- convergence_logprob – the maximum change in log probability to allow convergence
nltk.tag.hunpos module¶
A module for interfacing with the HunPos open-source POS-tagger.
-
class
nltk.tag.hunpos.
HunposTagger
(path_to_model, path_to_bin=None, encoding='ISO-8859-1', verbose=False)[source]¶ Bases:
nltk.tag.api.TaggerI
- A class for pos tagging with HunPos. The input is the paths to:
- a model trained on training data
- (optionally) the path to the hunpos-tag binary
- (optionally) the encoding of the training data (default: ISO-8859-1)
Example:
>>> from nltk.tag import HunposTagger >>> ht = HunposTagger('en_wsj.model') >>> ht.tag('What is the airspeed of an unladen swallow ?'.split()) [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')] >>> ht.close()
This class communicates with the hunpos-tag binary via pipes. When the tagger object is no longer needed, the close() method should be called to free system resources. The class supports the context manager interface; if used in a with statement, the close() method is invoked automatically:
>>> with HunposTagger('en_wsj.model') as ht: ... ht.tag('What is the airspeed of an unladen swallow ?'.split()) ... [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')]
nltk.tag.mapping module¶
Interface for converting POS tags from various treebanks to the universal tagset of Petrov, Das, & McDonald.
The tagset consists of the following 12 coarse tags:
VERB - verbs (all tenses and modes) NOUN - nouns (common and proper) PRON - pronouns ADJ - adjectives ADV - adverbs ADP - adpositions (prepositions and postpositions) CONJ - conjunctions DET - determiners NUM - cardinal numbers PRT - particles or other function words X - other: foreign words, typos, abbreviations . - punctuation
@see: http://arxiv.org/abs/1104.2086 and http://code.google.com/p/universal-pos-tags/
-
nltk.tag.mapping.
map_tag
(source, target, source_tag)[source]¶ Maps the tag from the source tagset to the target tagset.
>>> map_tag('en-ptb', 'universal', 'VBZ') 'VERB' >>> map_tag('en-ptb', 'universal', 'VBP') 'VERB' >>> map_tag('en-ptb', 'universal', '``') '.'
Retrieve the mapping dictionary between tagsets.
>>> tagset_mapping('ru-rnc', 'universal') == {'!': '.', 'A': 'ADJ', 'C': 'CONJ', 'AD': 'ADV', 'NN': 'NOUN', 'VG': 'VERB', 'COMP': 'CONJ', 'NC': 'NUM', 'VP': 'VERB', 'P': 'ADP', 'IJ': 'X', 'V': 'VERB', 'Z': 'X', 'VI': 'VERB', 'YES_NO_SENT': 'X', 'PTCL': 'PRT'} True
nltk.tag.perceptron module¶
-
class
nltk.tag.perceptron.
AveragedPerceptron
[source]¶ Bases:
object
An averaged perceptron, as implemented by Matthew Honnibal.
- See more implementation details here:
- https://explosion.ai/blog/part-of-speech-pos-tagger-in-python
-
class
nltk.tag.perceptron.
PerceptronTagger
(load=True)[source]¶ Bases:
nltk.tag.api.TaggerI
Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal. See more implementation details here:
>>> from nltk.tag.perceptron import PerceptronTagger
Train the model
>>> tagger = PerceptronTagger(load=False)
>>> tagger.train([[('today','NN'),('is','VBZ'),('good','JJ'),('day','NN')], ... [('yes','NNS'),('it','PRP'),('beautiful','JJ')]])
>>> tagger.tag(['today','is','a','beautiful','day']) [('today', 'NN'), ('is', 'PRP'), ('a', 'PRP'), ('beautiful', 'JJ'), ('day', 'NN')]
Use the pretrain model (the default constructor)
>>> pretrain = PerceptronTagger()
>>> pretrain.tag('The quick brown fox jumps over the lazy dog'.split()) [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
>>> pretrain.tag("The red cat".split()) [('The', 'DT'), ('red', 'JJ'), ('cat', 'NN')]
-
END
= ['-END-', '-END2-']¶
-
START
= ['-START-', '-START2-']¶
-
normalize
(word)[source]¶ Normalization used in pre-processing. - All words are lower cased - Groups of digits of length 4 are represented as !YEAR; - Other digits are represented as !DIGITS
Return type: str
-
train
(sentences, save_loc=None, nr_iter=5)[source]¶ Train a model from sentences, and save it at
save_loc
.nr_iter
controls the number of Perceptron training iterations.Parameters: - sentences – A list or iterator of sentences, where each sentence is a list of (words, tags) tuples.
- save_loc – If not
None
, saves a pickled model in this location. - nr_iter – Number of training iterations.
-
unicode_repr
¶ Return repr(self).
-
nltk.tag.senna module¶
Senna POS tagger, NER Tagger, Chunk Tagger
The input is: - path to the directory that contains SENNA executables. If the path is incorrect,
SennaTagger will automatically search for executable file specified in SENNA environment variable
- (optionally) the encoding of the input data (default:utf-8)
Note: Unit tests for this module can be found in test/unit/test_senna.py
>>> from nltk.tag import SennaTagger
>>> tagger = SennaTagger('/usr/share/senna-v3.0')
>>> tagger.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'),
('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'NN'), ('?', '.')]
>>> from nltk.tag import SennaChunkTagger
>>> chktagger = SennaChunkTagger('/usr/share/senna-v3.0')
>>> chktagger.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'),
('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'),
('?', 'O')]
>>> from nltk.tag import SennaNERTagger
>>> nertagger = SennaNERTagger('/usr/share/senna-v3.0')
>>> nertagger.tag('Shakespeare theatre was in London .'.split())
[('Shakespeare', 'B-PER'), ('theatre', 'O'), ('was', 'O'), ('in', 'O'),
('London', 'B-LOC'), ('.', 'O')]
>>> nertagger.tag('UN headquarters are in NY , USA .'.split())
[('UN', 'B-ORG'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'),
('NY', 'B-LOC'), (',', 'O'), ('USA', 'B-LOC'), ('.', 'O')]
-
class
nltk.tag.senna.
SennaChunkTagger
(path, encoding='utf-8')[source]¶ Bases:
nltk.classify.senna.Senna
-
bio_to_chunks
(tagged_sent, chunk_type)[source]¶ Extracts the chunks in a BIO chunk-tagged sentence.
>>> from nltk.tag import SennaChunkTagger >>> chktagger = SennaChunkTagger('/usr/share/senna-v3.0') >>> sent = 'What is the airspeed of an unladen swallow ?'.split() >>> tagged_sent = chktagger.tag(sent) >>> tagged_sent [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')] >>> list(chktagger.bio_to_chunks(tagged_sent, chunk_type='NP')) [('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]
Parameters: - tagged_sent (str) – A list of tuples of word and BIO chunk tag.
- tagged_sent – The chunk tag that users want to extract, e.g. ‘NP’ or ‘VP’
Returns: An iterable of tuples of chunks that users want to extract and their corresponding indices.
Return type: iter(tuple(str))
-
tag_sents
(sentences)[source]¶ Applies the tag method over a list of sentences. This method will return for each sentence a list of tuples of (word, tag).
-
unicode_repr
¶ Return repr(self).
-
-
class
nltk.tag.senna.
SennaNERTagger
(path, encoding='utf-8')[source]¶ Bases:
nltk.classify.senna.Senna
-
tag_sents
(sentences)[source]¶ Applies the tag method over a list of sentences. This method will return for each sentence a list of tuples of (word, tag).
-
unicode_repr
¶ Return repr(self).
-
nltk.tag.sequential module¶
Classes for tagging sentences sequentially, left to right. The
abstract base class SequentialBackoffTagger serves as the base
class for all the taggers in this module. Tagging of individual words
is performed by the method choose_tag()
, which is defined by
subclasses of SequentialBackoffTagger. If a tagger is unable to
determine a tag for the specified token, then its backoff tagger is
consulted instead. Any SequentialBackoffTagger may serve as a
backoff tagger for any other SequentialBackoffTagger.
-
class
nltk.tag.sequential.
AffixTagger
(train=None, model=None, affix_length=-3, min_stem_length=2, backoff=None, cutoff=0, verbose=False)[source]¶ Bases:
nltk.tag.sequential.ContextTagger
A tagger that chooses a token’s tag based on a leading or trailing substring of its word string. (It is important to note that these substrings are not necessarily “true” morphological affixes). In particular, a fixed-length substring of the word is looked up in a table, and the corresponding tag is returned. Affix taggers are typically constructed by training them on a tagged corpus.
Construct a new affix tagger.
Parameters: - affix_length – The length of the affixes that should be considered during training and tagging. Use negative numbers for suffixes.
- min_stem_length – Any words whose length is less than min_stem_length+abs(affix_length) will be assigned a tag of None by this tagger.
-
context
(tokens, index, history)[source]¶ Returns: the context that should be used to look up the tag for the specified token; or None if the specified token should not be handled by this tagger. Return type: (hashable)
-
json_tag
= 'nltk.tag.sequential.AffixTagger'¶
-
class
nltk.tag.sequential.
BigramTagger
(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]¶ Bases:
nltk.tag.sequential.NgramTagger
A tagger that chooses a token’s tag based its word string and on the preceding words’ tag. In particular, a tuple consisting of the previous tag and the word is looked up in a table, and the corresponding tag is returned.
Parameters: - train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
- model (dict) – The tagger model
- backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
- cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
-
json_tag
= 'nltk.tag.sequential.BigramTagger'¶
-
class
nltk.tag.sequential.
ClassifierBasedPOSTagger
(feature_detector=None, train=None, classifier_builder=<bound method NaiveBayesClassifier.train of <class 'nltk.classify.naivebayes.NaiveBayesClassifier'>>, classifier=None, backoff=None, cutoff_prob=None, verbose=False)[source]¶ Bases:
nltk.tag.sequential.ClassifierBasedTagger
A classifier based part of speech tagger.
-
class
nltk.tag.sequential.
ClassifierBasedTagger
(feature_detector=None, train=None, classifier_builder=<bound method NaiveBayesClassifier.train of <class 'nltk.classify.naivebayes.NaiveBayesClassifier'>>, classifier=None, backoff=None, cutoff_prob=None, verbose=False)[source]¶ Bases:
nltk.tag.sequential.SequentialBackoffTagger
,nltk.tag.api.FeaturesetTaggerI
A sequential tagger that uses a classifier to choose the tag for each token in a sentence. The featureset input for the classifier is generated by a feature detector function:
feature_detector(tokens, index, history) -> featureset
Where tokens is the list of unlabeled tokens in the sentence; index is the index of the token for which feature detection should be performed; and history is list of the tags for all tokens before index.
Construct a new classifier-based sequential tagger.
Parameters: - feature_detector – A function used to generate the featureset input for the classifier:: feature_detector(tokens, index, history) -> featureset
- train – A tagged corpus consisting of a list of tagged sentences, where each sentence is a list of (word, tag) tuples.
- backoff – A backoff tagger, to be used by the new tagger if it encounters an unknown context.
- classifier_builder – A function used to train a new classifier based on the data in train. It should take one argument, a list of labeled featuresets (i.e., (featureset, label) tuples).
- classifier – The classifier that should be used by the tagger. This is only useful if you want to manually construct the classifier; normally, you would use train instead.
- backoff – A backoff tagger, used if this tagger is unable to determine a tag for a given token.
- cutoff_prob – If specified, then this tagger will fall back on its backoff tagger if the probability of the most likely tag is less than cutoff_prob.
-
choose_tag
(tokens, index, history)[source]¶ Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
Return type: str
Parameters: - tokens (list) – The list of words that are being tagged.
- index (int) – The index of the word whose tag should be returned.
- history (list(str)) – A list of the tags for all words before index.
-
classifier
()[source]¶ Return the classifier that this tagger uses to choose a tag for each word in a sentence. The input for this classifier is generated using this tagger’s feature detector. See
feature_detector()
-
feature_detector
(tokens, index, history)[source]¶ Return the feature detector that this tagger uses to generate featuresets for its classifier. The feature detector is a function with the signature:
feature_detector(tokens, index, history) -> featureset
See
classifier()
-
unicode_repr
()¶ Return repr(self).
-
class
nltk.tag.sequential.
ContextTagger
(context_to_tag, backoff=None)[source]¶ Bases:
nltk.tag.sequential.SequentialBackoffTagger
An abstract base class for sequential backoff taggers that choose a tag for a token based on the value of its “context”. Different subclasses are used to define different contexts.
A ContextTagger chooses the tag for a token by calculating the token’s context, and looking up the corresponding tag in a table. This table can be constructed manually; or it can be automatically constructed based on a training corpus, using the
_train()
factory method.Variables: _context_to_tag – Dictionary mapping contexts to tags. -
choose_tag
(tokens, index, history)[source]¶ Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
Return type: str
Parameters: - tokens (list) – The list of words that are being tagged.
- index (int) – The index of the word whose tag should be returned.
- history (list(str)) – A list of the tags for all words before index.
-
context
(tokens, index, history)[source]¶ Returns: the context that should be used to look up the tag for the specified token; or None if the specified token should not be handled by this tagger. Return type: (hashable)
-
size
()[source]¶ Returns: The number of entries in the table used by this tagger to map from contexts to tags.
-
unicode_repr
()¶ Return repr(self).
-
-
class
nltk.tag.sequential.
DefaultTagger
(tag)[source]¶ Bases:
nltk.tag.sequential.SequentialBackoffTagger
A tagger that assigns the same tag to every token.
>>> from nltk.tag import DefaultTagger >>> default_tagger = DefaultTagger('NN') >>> list(default_tagger.tag('This is a test'.split())) [('This', 'NN'), ('is', 'NN'), ('a', 'NN'), ('test', 'NN')]
This tagger is recommended as a backoff tagger, in cases where a more powerful tagger is unable to assign a tag to the word (e.g. because the word was not seen during training).
Parameters: tag (str) – The tag to assign to each token -
choose_tag
(tokens, index, history)[source]¶ Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
Return type: str
Parameters: - tokens (list) – The list of words that are being tagged.
- index (int) – The index of the word whose tag should be returned.
- history (list(str)) – A list of the tags for all words before index.
-
json_tag
= 'nltk.tag.sequential.DefaultTagger'¶
-
unicode_repr
()¶ Return repr(self).
-
-
class
nltk.tag.sequential.
NgramTagger
(n, train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]¶ Bases:
nltk.tag.sequential.ContextTagger
A tagger that chooses a token’s tag based on its word string and on the preceding n word’s tags. In particular, a tuple (tags[i-n:i-1], words[i]) is looked up in a table, and the corresponding tag is returned. N-gram taggers are typically trained on a tagged corpus.
Train a new NgramTagger using the given training data or the supplied model. In particular, construct a new tagger whose table maps from each context (tag[i-n:i-1], word[i]) to the most frequent tag for that context. But exclude any contexts that are already tagged perfectly by the backoff tagger.
Parameters: - train – A tagged corpus consisting of a list of tagged sentences, where each sentence is a list of (word, tag) tuples.
- backoff – A backoff tagger, to be used by the new tagger if it encounters an unknown context.
- cutoff – If the most likely tag for a context occurs fewer than cutoff times, then exclude it from the context-to-tag table for the new tagger.
-
context
(tokens, index, history)[source]¶ Returns: the context that should be used to look up the tag for the specified token; or None if the specified token should not be handled by this tagger. Return type: (hashable)
-
json_tag
= 'nltk.tag.sequential.NgramTagger'¶
-
class
nltk.tag.sequential.
RegexpTagger
(regexps, backoff=None)[source]¶ Bases:
nltk.tag.sequential.SequentialBackoffTagger
Regular Expression Tagger
The RegexpTagger assigns tags to tokens by comparing their word strings to a series of regular expressions. The following tagger uses word suffixes to make guesses about the correct Brown Corpus part of speech tag:
>>> from nltk.corpus import brown >>> from nltk.tag import RegexpTagger >>> test_sent = brown.sents(categories='news')[0] >>> regexp_tagger = RegexpTagger( ... [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers ... (r'(The|the|A|a|An|an)$', 'AT'), # articles ... (r'.*able$', 'JJ'), # adjectives ... (r'.*ness$', 'NN'), # nouns formed from adjectives ... (r'.*ly$', 'RB'), # adverbs ... (r'.*s$', 'NNS'), # plural nouns ... (r'.*ing$', 'VBG'), # gerunds ... (r'.*ed$', 'VBD'), # past tense verbs ... (r'.*', 'NN') # nouns (default) ... ]) >>> regexp_tagger <Regexp Tagger: size=9> >>> regexp_tagger.tag(test_sent) [('The', 'AT'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'), ('said', 'NN'), ('Friday', 'NN'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'NN'), ("Atlanta's", 'NNS'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', 'NN'), ('no', 'NN'), ('evidence', 'NN'), ("''", 'NN'), ('that', 'NN'), ('any', 'NN'), ('irregularities', 'NNS'), ('took', 'NN'), ('place', 'NN'), ('.', 'NN')]
Parameters: regexps (list(tuple(str, str))) – A list of (regexp, tag)
pairs, each of which indicates that a word matchingregexp
should be tagged withtag
. The pairs will be evalutated in order. If none of the regexps match a word, then the optional backoff tagger is invoked, else it is assigned the tag None.-
choose_tag
(tokens, index, history)[source]¶ Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
Return type: str
Parameters: - tokens (list) – The list of words that are being tagged.
- index (int) – The index of the word whose tag should be returned.
- history (list(str)) – A list of the tags for all words before index.
-
json_tag
= 'nltk.tag.sequential.RegexpTagger'¶
-
unicode_repr
()¶ Return repr(self).
-
-
class
nltk.tag.sequential.
SequentialBackoffTagger
(backoff=None)[source]¶ Bases:
nltk.tag.api.TaggerI
An abstract base class for taggers that tags words sequentially, left to right. Tagging of individual words is performed by the
choose_tag()
method, which should be defined by subclasses. If a tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.Variables: _taggers – A list of all the taggers that should be tried to tag a token (i.e., self and its backoff taggers). -
backoff
¶ The backoff tagger for this tagger.
-
choose_tag
(tokens, index, history)[source]¶ Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
Return type: str
Parameters: - tokens (list) – The list of words that are being tagged.
- index (int) – The index of the word whose tag should be returned.
- history (list(str)) – A list of the tags for all words before index.
-
tag
(tokens)[source]¶ Determine the most appropriate tag sequence for the given token sequence, and return a corresponding list of tagged tokens. A tagged token is encoded as a tuple
(token, tag)
.Return type: list(tuple(str, str))
-
tag_one
(tokens, index, history)[source]¶ Determine an appropriate tag for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.
Return type: str
Parameters: - tokens (list) – The list of words that are being tagged.
- index (int) – The index of the word whose tag should be returned.
- history (list(str)) – A list of the tags for all words before index.
-
-
class
nltk.tag.sequential.
TrigramTagger
(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]¶ Bases:
nltk.tag.sequential.NgramTagger
A tagger that chooses a token’s tag based its word string and on the preceding two words’ tags. In particular, a tuple consisting of the previous two tags and the word is looked up in a table, and the corresponding tag is returned.
Parameters: - train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
- model (dict) – The tagger model
- backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
- cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
-
json_tag
= 'nltk.tag.sequential.TrigramTagger'¶
-
class
nltk.tag.sequential.
UnigramTagger
(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]¶ Bases:
nltk.tag.sequential.NgramTagger
Unigram Tagger
The UnigramTagger finds the most likely tag for each word in a training corpus, and then uses that information to assign tags to new tokens.
>>> from nltk.corpus import brown >>> from nltk.tag import UnigramTagger >>> test_sent = brown.sents(categories='news')[0] >>> unigram_tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500]) >>> for tok, tag in unigram_tagger.tag(test_sent): ... print("(%s, %s), " % (tok, tag)) (The, AT), (Fulton, NP-TL), (County, NN-TL), (Grand, JJ-TL), (Jury, NN-TL), (said, VBD), (Friday, NR), (an, AT), (investigation, NN), (of, IN), (Atlanta's, NP$), (recent, JJ), (primary, NN), (election, NN), (produced, VBD), (``, ``), (no, AT), (evidence, NN), ('', ''), (that, CS), (any, DTI), (irregularities, NNS), (took, VBD), (place, NN), (., .),
Parameters: - train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
- model (dict) – The tagger model
- backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
- cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
-
context
(tokens, index, history)[source]¶ Returns: the context that should be used to look up the tag for the specified token; or None if the specified token should not be handled by this tagger. Return type: (hashable)
-
json_tag
= 'nltk.tag.sequential.UnigramTagger'¶
nltk.tag.stanford module¶
A module for interfacing with the Stanford taggers.
Tagger models need to be downloaded from https://nlp.stanford.edu/software and the STANFORD_MODELS environment variable set (a colon-separated list of paths).
For more details see the documentation for StanfordPOSTagger and StanfordNERTagger.
-
class
nltk.tag.stanford.
StanfordNERTagger
(*args, **kwargs)[source]¶ Bases:
nltk.tag.stanford.StanfordTagger
A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to:
- a model trained on training data
- (optionally) the path to the stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable.
- (optionally) the encoding of the training data (default: UTF-8)
Example:
>>> from nltk.tag import StanfordNERTagger >>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') >>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
-
class
nltk.tag.stanford.
StanfordPOSTagger
(*args, **kwargs)[source]¶ Bases:
nltk.tag.stanford.StanfordTagger
- A class for pos tagging with Stanford Tagger. The input is the paths to:
- a model trained on training data
- (optionally) the path to the stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable.
- (optionally) the encoding of the training data (default: UTF-8)
Example:
>>> from nltk.tag import StanfordPOSTagger >>> st = StanfordPOSTagger('english-bidirectional-distsim.tagger') >>> st.tag('What is the airspeed of an unladen swallow ?'.split()) [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
-
class
nltk.tag.stanford.
StanfordTagger
(model_filename, path_to_jar=None, encoding='utf8', verbose=False, java_options='-mx1000m')[source]¶ Bases:
nltk.tag.api.TaggerI
An interface to Stanford taggers. Subclasses must define:
_cmd
property: A property that returns the command that will be executed._SEPARATOR
: Class constant that represents that character that is used to separate the tokens from their tags._JAR
file: Class constant that represents the jar file name.
nltk.tag.tnt module¶
Implementation of ‘TnT - A Statisical Part of Speech Tagger’ by Thorsten Brants
http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
-
class
nltk.tag.tnt.
TnT
(unk=None, Trained=False, N=1000, C=False)[source]¶ Bases:
nltk.tag.api.TaggerI
TnT - Statistical POS tagger
IMPORTANT NOTES:
- DOES NOT AUTOMATICALLY DEAL WITH UNSEEN WORDS
- It is possible to provide an untrained POS tagger to create tags for unknown words, see __init__ function
- SHOULD BE USED WITH SENTENCE-DELIMITED INPUT
- Due to the nature of this tagger, it works best when trained over sentence delimited input.
- However it still produces good results if the training data and testing data are separated on all punctuation eg: [,.?!]
- Input for training is expected to be a list of sentences where each sentence is a list of (word, tag) tuples
- Input for tag function is a single sentence Input for tagdata function is a list of sentences Output is of a similar form
- Function provided to process text that is unsegmented
- Please see basic_sent_chop()
TnT uses a second order Markov model to produce tags for a sequence of input, specifically:
argmax [Proj(P(t_i|t_i-1,t_i-2)P(w_i|t_i))] P(t_T+1 | t_T)IE: the maximum projection of a set of probabilities
The set of possible tags for a given word is derived from the training data. It is the set of all tags that exact word has been assigned.
To speed up and get more precision, we can use log addition to instead multiplication, specifically:
- argmax [Sigma(log(P(t_i|t_i-1,t_i-2))+log(P(w_i|t_i)))] +
- log(P(t_T+1|t_T))
The probability of a tag for a given word is the linear interpolation of 3 markov models; a zero-order, first-order, and a second order model.
- P(t_i| t_i-1, t_i-2) = l1*P(t_i) + l2*P(t_i| t_i-1) +
- l3*P(t_i| t_i-1, t_i-2)
A beam search is used to limit the memory usage of the algorithm. The degree of the beam can be changed using N in the initialization. N represents the maximum number of possible solutions to maintain while tagging.
It is possible to differentiate the tags which are assigned to capitalized words. However this does not result in a significant gain in the accuracy of the results.
-
tag
(data)[source]¶ Tags a single sentence
Parameters: data ([string,]) – list of words Returns: [(word, tag),] Calls recursive function ‘_tagword’ to produce a list of tags
Associates the sequence of returned tags with the correct words in the input sequence
returns a list of (word, tag) tuples
-
tagdata
(data)[source]¶ Tags each sentence in a list of sentences
:param data:list of list of words :type data: [[string,],] :return: list of list of (word, tag) tuples
Invokes tag(sent) function for each sentence compiles the results into a list of tagged sentences each tagged sentence is a list of (word, tag) tuples
- DOES NOT AUTOMATICALLY DEAL WITH UNSEEN WORDS
-
nltk.tag.tnt.
basic_sent_chop
(data, raw=True)[source]¶ Basic method for tokenizing input into sentences for this tagger:
Parameters: - data (str or tuple(str, str)) – list of tokens (words or (word, tag) tuples)
- raw (bool) – boolean flag marking the input data as a list of words or a list of tagged words
Returns: list of sentences sentences are a list of tokens tokens are the same as the input
Function takes a list of tokens and separates the tokens into lists where each list represents a sentence fragment This function can separate both tagged and raw sequences into basic sentences.
Sentence markers are the set of [,.!?]
This is a simple method which enhances the performance of the TnT tagger. Better sentence tokenization will further enhance the results.
nltk.tag.util module¶
-
nltk.tag.util.
str2tuple
(s, sep='/')[source]¶ Given the string representation of a tagged token, return the corresponding tuple representation. The rightmost occurrence of sep in s will be used to divide s into a word string and a tag string. If sep does not occur in s, return (s, None).
>>> from nltk.tag.util import str2tuple >>> str2tuple('fly/NN') ('fly', 'NN')
Parameters: - s (str) – The string representation of a tagged token.
- sep (str) – The separator string used to separate word strings from tags.
-
nltk.tag.util.
tuple2str
(tagged_token, sep='/')[source]¶ Given the tuple representation of a tagged token, return the corresponding string representation. This representation is formed by concatenating the token’s word string, followed by the separator, followed by the token’s tag. (If the tag is None, then just return the bare word string.)
>>> from nltk.tag.util import tuple2str >>> tagged_token = ('fly', 'NN') >>> tuple2str(tagged_token) 'fly/NN'
Parameters: - tagged_token (tuple(str, str)) – The tuple representation of a tagged token.
- sep (str) – The separator string used to separate word strings from tags.
-
nltk.tag.util.
untag
(tagged_sentence)[source]¶ Given a tagged sentence, return an untagged version of that sentence. I.e., return a list containing the first element of each tuple in tagged_sentence.
>>> from nltk.tag.util import untag >>> untag([('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')]) ['John', 'saw', 'Mary']
Module contents¶
NLTK Taggers
This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.
A “tag” is a case-sensitive string that specifies some property of a token,
such as its part of speech. Tagged tokens are encoded as tuples
(tag, token)
. For example, the following tagged token combines
the word 'fly'
with a noun part of speech tag ('NN'
):
>>> tagged_tok = ('fly', 'NN')
An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
A Russian tagger is also available if you specify lang=”rus”. It uses the Russian National Corpus tagset:
>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus')
[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),
('бумажку', 'S'), ('.', 'NONLEX')]
This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:
>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
... print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None
Note that words that the tagger has not seen during training receive a tag
of None
.
We evaluate a tagger on data that was not seen during training:
>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.73...
For more information, please consult chapter 5 of the NLTK Book.
-
nltk.tag.
pos_tag
(tokens, tagset=None, lang='eng')[source]¶ Use NLTK’s currently recommended part of speech tagger to tag the given list of tokens.
>>> from nltk.tag import pos_tag >>> from nltk.tokenize import word_tokenize >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')] >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal') [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'), ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
NB. Use pos_tag_sents() for efficient tagging of more than one sentence.
Parameters: - tokens (list(str)) – Sequence of tokens to be tagged
- tagset (str) – the tagset to be used, e.g. universal, wsj, brown
- lang (str) – the ISO 639 code of the language, e.g. ‘eng’ for English, ‘rus’ for Russian
Returns: The tagged tokens
Return type: list(tuple(str, str))
-
nltk.tag.
pos_tag_sents
(sentences, tagset=None, lang='eng')[source]¶ Use NLTK’s currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.
Parameters: - tokens (list(list(str))) – List of sentences to be tagged
- tagset (str) – the tagset to be used, e.g. universal, wsj, brown
- lang (str) – the ISO 639 code of the language, e.g. ‘eng’ for English, ‘rus’ for Russian
Returns: The list of tagged sentences
Return type: list(list(tuple(str, str)))