Package nltk :: Package tokenize :: Module punkt :: Class PunktSentenceTokenizer
Class PunktSentenceTokenizer

source code

     object --+    
_PunktBaseClass --+
     object --+   |
              |   |
 api.TokenizerI --+

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

Inherited from _PunktBaseClass (private): _Token

__init__(self, train_text=None, verbose=False)
train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.
source code
train(self, train_text, verbose=False)
Derives parameters from a given training text, or uses the parameters given.
source code

Inherited from api.TokenizerI: batch_tokenize

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

tokenize(self, text, realign_boundaries=False)
Given a text, returns a list of the sentences in that text.
source code
sentences_from_text(self, text, realign_boundaries=False)
Given a text, generates the sentences in that text by only testing candidate sentence breaks.
source code
_sentences_from_text(self, text)
text_contains_sentbreak(self, text)
Returns True if the given text includes a sentence break.
source code
sentences_from_text_legacy(self, text)
Given a text, generates the sentences in that text.
source code
sentences_from_tokens(self, tokens)
Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.
source code
_annotate_tokens(self, tokens)
Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.
source code
_build_sentence_list(self, text, tokens)
Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings.
source code
dump(self, tokens)
_annotate_second_pass(self, tokens)
Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).
source code
_second_pass_annotation(self, aug_tok1, aug_tok2)
Performs token-based classification over a pair of contiguous tokens returning an updated augmented token for the first of them.
source code
_ortho_heuristic(self, aug_tok)
Decide whether the given token is the first token in a sentence.
source code
Inherited from _PunktBaseClass (private): _tokenize_words

Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.
source code
Inherited from _PunktBaseClass: pair_iter

  PUNCTUATION = (';', ':', ',', '.', '!', '?')
Inherited from _PunktBaseClass (private): _params

Inherited from object: __class__

__init__(self, train_text=None, verbose=False)

source code 

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

Overrides: _PunktBaseClass.__init__

train(self, train_text, verbose=False)

source code 

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

tokenize(self, text, realign_boundaries=False)

source code 

Given a text, returns a list of the sentences in that text.

list of str
Overrides: api.TokenizerI.tokenize

sentences_from_text(self, text, realign_boundaries=False)

source code 

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

source code 

Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.

For example: "(Sent1.) Sent2." will otherwise be split as:

   ["(Sent1.", ") Sent1."].

This method will produce:

   ["(Sent1.)", "Sent2."].

sentences_from_text_legacy(self, text)

source code 

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.