Learns parameters used in Punkt sentence boundary detection.
|
|
|
get_params(self)
Calculates and returns parameters for sentence boundary detection as
derived from training. |
source code
|
|
Inherited from object :
__delattr__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__repr__ ,
__setattr__ ,
__str__
|
|
train(self,
text,
verbose=False,
finalize=True)
Collects training data from a given text. |
source code
|
|
|
train_tokens(self,
tokens,
verbose=False,
finalize=True)
Collects training data from a given list of tokens. |
source code
|
|
|
|
|
|
|
finalize_training(self,
verbose=False)
Uses data that has been gathered in training to determine likely
collocations and sentence starters. |
source code
|
|
|
freq_threshold(self,
ortho_thresh=2,
type_thresh=2,
colloc_thres=2,
sentstart_thresh=2)
Allows memory use to be reduced after much training by removing data
about rare tokens that are unlikely to have a statistical effect with
further training. |
source code
|
|
|
_freq_threshold(self,
fdist,
threshold)
Returns a FreqDist containing only data with counts below a given
threshold, as well as a mapping (None -> count_removed). |
source code
|
|
|
_get_orthography_data(self,
tokens)
Collect information about whether each token type occurs with
different case patterns (i) overall, (ii) at sentence-initial
positions, and (iii) at sentence-internal positions. |
source code
|
|
|
|
|
|
|
|
|
_is_potential_collocation(self,
aug_tok1,
aug_tok2)
Returns True if the pair of tokens may form a collocation given
log-likelihood statistics. |
source code
|
|
|
_find_collocations(self)
Generates likely collocations and their log-likelihood. |
source code
|
|
|
_is_potential_sent_starter(self,
cur_tok,
prev_tok)
Returns True given a token and the token that preceds it if it seems
clear that the token is beginning a sentence. |
source code
|
|
|
_find_sent_starters(self)
Uses collocation heuristics for each candidate token to determine if
it frequently starts sentences. |
source code
|
|
|
_get_sentbreak_count(self,
tokens)
Returns the number of sentence breaks marked in a given set of
augmented tokens. |
source code
|
|
|
|
|
ABBREV = 0.3
cut-off value whether a 'token' is an abbreviation
|
|
IGNORE_ABBREV_PENALTY = False
allows the disabling of the abbreviation penalty heuristic, which
exponentially disadvantages words that are found at times without a
final period.
|
|
ABBREV_BACKOFF = 5
upper cut-off for Mikheev's(2002) abbreviation detection algorithm
|
|
COLLOCATION = 7.88
minimal log-likelihood value that two tokens need to be considered as
a collocation
|
|
SENT_STARTER = 30
minimal log-likelihood value that a token requires to be considered
as a frequent sentence starter
|
|
INTERNAL_PUNCTUATION = ' ,:; '
sentence internal punctuation, which indicates an abbreviation if
preceded by a period-final token.
|
|
INCLUDE_ALL_COLLOCS = False
this includes as potential collocations all word pairs where the
first word ends in a period.
|
|
INCLUDE_ABBREV_COLLOCS = False
this includes as potential collocations all word pairs where the
first word is an abbreviation.
|
|
MIN_COLLOC_FREQ = 1
this sets a minimum bound on the number of times a bigram needs to
appear before it can be considered a collocation, in addition to log
likelihood statistics.
|
|
_type_fdist
A frequency distribution giving the frequency of each case-normalized
token type in the training data.
|
|
_num_period_toks
The number of words ending in period in the training data.
|
|
_collocation_fdist
A frequency distribution giving the frequency of all bigrams in the
training data where the first word ends in a period.
|
|
_sent_starter_fdist
A frequency distribution giving the frequency of all words that occur
at the training data at the beginning of a sentence (after the first
pass of annotation).
|
|
_sentbreak_count
The total number of sentence breaks identified in training, used for
calculating the frequent sentence starter heuristic.
|
|
_finalized
A flag as to whether the training has been finalized by finding
collocations and sentence starters, or whether finalize_training()
still needs to be called.
|
|