Package nltk :: Package tokenize :: Module punkt :: Class PunktSentenceTokenizer

Class PunktSentenceTokenizer

     object --+    
              |    
_PunktBaseClass --+
                  |
     object --+   |
              |   |
 api.TokenizerI --+
                  |
                 PunktSentenceTokenizer

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

Nested Classes

[hide private]

Inherited from _PunktBaseClass (private): _Token

Instance Methods

[hide private]

__init__(self, train_text=None, verbose=False)
train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

source code

train(self, train_text, verbose=False)
Derives parameters from a given training text, or uses the parameters given.

source code

Inherited from api.TokenizerI: batch_tokenize

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Tokenization

tokenize(self, text, realign_boundaries=False)
Given a text, returns a list of the sentences in that text.

source code

sentences_from_text(self, text, realign_boundaries=False)
Given a text, generates the sentences in that text by only testing candidate sentence breaks.

source code

_sentences_from_text(self, text)

source code

text_contains_sentbreak(self, text)
Returns True if the given text includes a sentence break.

source code

sentences_from_text_legacy(self, text)
Given a text, generates the sentences in that text.

source code

sentences_from_tokens(self, tokens)
Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

source code

_annotate_tokens(self, tokens)
Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.

source code

_build_sentence_list(self, text, tokens)
Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings.

source code

dump(self, tokens)

source code

Annotation Procedures

_annotate_second_pass(self, tokens)
Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).

source code

_second_pass_annotation(self, aug_tok1, aug_tok2)
Performs token-based classification over a pair of contiguous tokens returning an updated augmented token for the first of them.

source code

_ortho_heuristic(self, aug_tok)
Decide whether the given token is the first token in a sentence.

source code

Inherited from _PunktBaseClass (private): _annotate_first_pass, _first_pass_annotation

Word tokenization

Inherited from _PunktBaseClass (private): _tokenize_words

Static Methods

[hide private]

Tokenization

_realign_boundaries(sents)
Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.

source code

Helper Functions

Inherited from _PunktBaseClass: pair_iter

Class Variables

[hide private]

Customization Variables

PUNCTUATION = (';', ':', ',', '.', '!', '?')

Instance Variables

[hide private]

Inherited from _PunktBaseClass (private): _params

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, train_text=None, verbose=False)
(Constructor)

source code

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

Overrides: _PunktBaseClass.__init__

train(self, train_text, verbose=False)

source code

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

tokenize(self, text, realign_boundaries=False)

source code

Given a text, returns a list of the sentences in that text.

Returns:: list of str
Overrides: api.TokenizerI.tokenize

sentences_from_text(self, text, realign_boundaries=False)

source code

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

_realign_boundaries(sents)
Static Method

source code

Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.

For example: "(Sent1.) Sent2." will otherwise be split as:

   ["(Sent1.", ") Sent1."].

This method will produce:

   ["(Sent1.)", "Sent2."].

sentences_from_text_legacy(self, text)

source code

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

Class PunktSentenceTokenizer

__init__(self, train_text=None, verbose=False) (Constructor)

train(self, train_text, verbose=False)

tokenize(self, text, realign_boundaries=False)

sentences_from_text(self, text, realign_boundaries=False)

_realign_boundaries(sents) Static Method

sentences_from_text_legacy(self, text)

init(self, train_text=None, verbose=False)
(Constructor)

_realign_boundaries(sents)
Static Method