Package nltk :: Package tokenize :: Module punkt :: Class PunktTrainer

Class PunktTrainer

     object --+    
              |    
_PunktBaseClass --+
                  |
                 PunktTrainer

Learns parameters used in Punkt sentence boundary detection.

Nested Classes

[hide private]

Inherited from _PunktBaseClass (private): _Token

Instance Methods

[hide private]

__init__(self, train_text=None, verbose=False)

source code

get_params(self)
Calculates and returns parameters for sentence boundary detection as derived from training.

source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Training..

train(self, text, verbose=False, finalize=True)
Collects training data from a given text.

source code

train_tokens(self, tokens, verbose=False, finalize=True)
Collects training data from a given list of tokens.

source code

_train_tokens(self, tokens, verbose)

source code

_unique_types(self, tokens)

source code

finalize_training(self, verbose=False)
Uses data that has been gathered in training to determine likely collocations and sentence starters.

source code

Overhead reduction

freq_threshold(self, ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2)
Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training.

source code

_freq_threshold(self, fdist, threshold)
Returns a FreqDist containing only data with counts below a given threshold, as well as a mapping (None -> count_removed).

source code

Orthographic data

_get_orthography_data(self, tokens)
Collect information about whether each token type occurs with different case patterns (i) overall, (ii) at sentence-initial positions, and (iii) at sentence-internal positions.

source code

Abbreviations

_reclassify_abbrev_types(self, types)
(Re)classifies each given token if

source code

find_abbrev_types(self)
Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations.

source code

_is_rare_abbrev_type(self, cur_tok, next_tok)
A word type is counted as a rare abbreviation if...

source code

Collocation Finder

_is_potential_collocation(self, aug_tok1, aug_tok2)
Returns True if the pair of tokens may form a collocation given log-likelihood statistics.

source code

_find_collocations(self)
Generates likely collocations and their log-likelihood.

source code

Sentence-Starter Finder

_is_potential_sent_starter(self, cur_tok, prev_tok)
Returns True given a token and the token that preceds it if it seems clear that the token is beginning a sentence.

source code

_find_sent_starters(self)
Uses collocation heuristics for each candidate token to determine if it frequently starts sentences.

source code

_get_sentbreak_count(self, tokens)
Returns the number of sentence breaks marked in a given set of augmented tokens.

source code

Word tokenization

Inherited from _PunktBaseClass (private): _tokenize_words

Annotation Procedures

Inherited from _PunktBaseClass (private): _annotate_first_pass, _first_pass_annotation

Static Methods

[hide private]

Log Likelihoods

_dunning_log_likelihood(count_a, count_b, count_ab, N)
A function that calculates the modified Dunning log-likelihood ratio scores for abbreviation candidates.

source code

_col_log_likelihood(count_a, count_b, count_ab, N)
A function that will just compute log-likelihood estimate, in the original paper it's decribed in algorithm 6 and 7.

source code

Helper Functions

Inherited from _PunktBaseClass: pair_iter

Class Variables

[hide private]

Customization Variables

ABBREV = 0.3
cut-off value whether a 'token' is an abbreviation

IGNORE_ABBREV_PENALTY = False
allows the disabling of the abbreviation penalty heuristic, which exponentially disadvantages words that are found at times without a final period.

ABBREV_BACKOFF = 5
upper cut-off for Mikheev's(2002) abbreviation detection algorithm

COLLOCATION = 7.88
minimal log-likelihood value that two tokens need to be considered as a collocation

SENT_STARTER = 30
minimal log-likelihood value that a token requires to be considered as a frequent sentence starter

INTERNAL_PUNCTUATION = ',:;'
sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.

INCLUDE_ALL_COLLOCS = False
this includes as potential collocations all word pairs where the first word ends in a period.

INCLUDE_ABBREV_COLLOCS = False
this includes as potential collocations all word pairs where the first word is an abbreviation.

MIN_COLLOC_FREQ = 1
this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics.

Instance Variables

[hide private]

_type_fdist
A frequency distribution giving the frequency of each case-normalized token type in the training data.

_num_period_toks
The number of words ending in period in the training data.

_collocation_fdist
A frequency distribution giving the frequency of all bigrams in the training data where the first word ends in a period.

_sent_starter_fdist
A frequency distribution giving the frequency of all words that occur at the training data at the beginning of a sentence (after the first pass of annotation).

_sentbreak_count
The total number of sentence breaks identified in training, used for calculating the frequent sentence starter heuristic.

_finalized
A flag as to whether the training has been finalized by finding collocations and sentence starters, or whether finalize_training() still needs to be called.

Inherited from _PunktBaseClass (private): _params

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, train_text=None, verbose=False)
(Constructor)

source code

Overrides: _PunktBaseClass.__init__

train(self, text, verbose=False, finalize=True)

source code

Collects training data from a given text. If finalize is True, it will determine all the parameters for sentence boundary detection. If not, this will be delayed until get_params() or finalize_training() is called. If verbose is True, abbreviations found will be listed.

freq_threshold(self, ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2)

source code

Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained.

_reclassify_abbrev_types(self, types)

source code

(Re)classifies each given token if

it is period-final and not a known abbreviation; or
it is not period-final and is otherwise a known abbreviation

by checking whether its previous classification still holds according to the heuristics of section 3. Yields triples (abbr, score, is_add) where abbr is the type in question, score is its log-likelihood with penalties applied, and is_add specifies whether the present type is a candidate for inclusion or exclusion as an abbreviation, such that:

(is_add and score >= 0.3) suggests a new abbreviation; and
(not is_add and score < 0.3) suggests excluding an abbreviation.

find_abbrev_types(self)

source code

Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as "rare".

_is_rare_abbrev_type(self, cur_tok, next_tok)

source code

A word type is counted as a rare abbreviation if...

it's not already marked as an abbreviation
it occurs fewer than ABBREV_BACKOFF times
either it is followed by a sentence-internal punctuation mark, *or* it is followed by a lower-case word that sometimes appears with upper case, but never occurs with lower case at the beginning of sentences.

_dunning_log_likelihood(count_a, count_b, count_ab, N)
Static Method

source code

A function that calculates the modified Dunning log-likelihood ratio scores for abbreviation candidates. The details of how this works is available in the paper.

_col_log_likelihood(count_a, count_b, count_ab, N)
Static Method

source code

A function that will just compute log-likelihood estimate, in the original paper it's decribed in algorithm 6 and 7.

This *should* be the original Dunning log-likelihood values, unlike the previous log_l function where it used modified Dunning log-likelihood values

Class Variable Details

[hide private]

INCLUDE_ALL_COLLOCS

this includes as potential collocations all word pairs where the first word ends in a period. It may be useful in corpora where there is a lot of variation that makes abbreviations like Mr difficult to identify.

Value:

False

INCLUDE_ABBREV_COLLOCS

this includes as potential collocations all word pairs where the first word is an abbreviation. Such collocations override the orthographic heuristic, but not the sentence starter heuristic. This is overridden by INCLUDE_ALL_COLLOCS, and if both are false, only collocations with initials and ordinals are considered.

Value:

False

MIN_COLLOC_FREQ

this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics. This is useful when INCLUDE_ALL_COLLOCS is True.

Value:

Instance Variable Details

[hide private]

_collocation_fdist

A frequency distribution giving the frequency of all bigrams in the training data where the first word ends in a period. Bigrams are encoded as tuples of word types. Especially common collocations are extracted from this frequency distribution, and stored in _params.collocations.

_sent_starter_fdist

A frequency distribution giving the frequency of all words that occur at the training data at the beginning of a sentence (after the first pass of annotation). Especially common sentence starters are extracted from this frequency distribution, and stored in _params.sent_starters.

Class PunktTrainer

__init__(self, train_text=None, verbose=False) (Constructor)

train(self, text, verbose=False, finalize=True)

freq_threshold(self, ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2)

_reclassify_abbrev_types(self, types)

find_abbrev_types(self)

_is_rare_abbrev_type(self, cur_tok, next_tok)

_dunning_log_likelihood(count_a, count_b, count_ab, N) Static Method

_col_log_likelihood(count_a, count_b, count_ab, N) Static Method

INCLUDE_ALL_COLLOCS

INCLUDE_ABBREV_COLLOCS

MIN_COLLOC_FREQ

_collocation_fdist

_sent_starter_fdist

init(self, train_text=None, verbose=False)
(Constructor)

_dunning_log_likelihood(count_a, count_b, count_ab, N)
Static Method

_col_log_likelihood(count_a, count_b, count_ab, N)
Static Method