A sentence tokenizer which uses an unsupervised algorithm to build a
model for abbreviation words, collocations, and words that start
sentences; and then uses that model to find sentence boundaries. This
approach has been shown to work well for many European languages.
|
__init__(self,
train_text=None,
verbose=False)
train_text can either be the sole training text for this sentence
boundary detector, or can be a PunktParameters object. |
source code
|
|
|
train(self,
train_text,
verbose=False)
Derives parameters from a given training text, or uses the parameters
given. |
source code
|
|
Inherited from api.TokenizerI :
batch_tokenize
Inherited from object :
__delattr__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__repr__ ,
__setattr__ ,
__str__
|
|
tokenize(self,
text,
realign_boundaries=False)
Given a text, returns a list of the sentences in that text. |
source code
|
|
|
sentences_from_text(self,
text,
realign_boundaries=False)
Given a text, generates the sentences in that text by only testing
candidate sentence breaks. |
source code
|
|
|
|
|
text_contains_sentbreak(self,
text)
Returns True if the given text includes a sentence break. |
source code
|
|
|
|
|
sentences_from_tokens(self,
tokens)
Given a sequence of tokens, generates lists of tokens, each list
corresponding to a sentence. |
source code
|
|
|
_annotate_tokens(self,
tokens)
Given a set of tokens augmented with markers for line-start and
paragraph-start, returns an iterator through those tokens with full
annotation including predicted sentence breaks. |
source code
|
|
|
_build_sentence_list(self,
text,
tokens)
Given the original text and the list of augmented word tokens,
construct and return a tokenized list of sentence strings. |
source code
|
|
|
|
|
_annotate_second_pass(self,
tokens)
Performs a token-based classification (section 4) over the given
tokens, making use of the orthographic heuristic (4.1.1), collocation
heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3). |
source code
|
|
|
_second_pass_annotation(self,
aug_tok1,
aug_tok2)
Performs token-based classification over a pair of contiguous tokens
returning an updated augmented token for the first of them. |
source code
|
|
|
_ortho_heuristic(self,
aug_tok)
Decide whether the given token is the first token in a sentence. |
source code
|
|
|
|