Package nltk :: Package tokenize :: Module punkt :: Class PunktSentenceTokenizer
[hide private]
[frames] | no frames]

Class PunktSentenceTokenizer

source code

     object --+    
              |    
_PunktBaseClass --+
                  |
     object --+   |
              |   |
 api.TokenizerI --+
                  |
                 PunktSentenceTokenizer

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

Nested Classes [hide private]

Inherited from _PunktBaseClass (private): _Token

Instance Methods [hide private]
 
__init__(self, train_text=None, verbose=False)
train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.
source code
 
train(self, train_text, verbose=False)
Derives parameters from a given training text, or uses the parameters given.
source code

Inherited from api.TokenizerI: batch_tokenize

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

    Tokenization
 
tokenize(self, text, realign_boundaries=False)
Given a text, returns a list of the sentences in that text.
source code
 
sentences_from_text(self, text, realign_boundaries=False)
Given a text, generates the sentences in that text by only testing candidate sentence breaks.
source code
 
_sentences_from_text(self, text) source code
 
text_contains_sentbreak(self, text)
Returns True if the given text includes a sentence break.
source code
 
sentences_from_text_legacy(self, text)
Given a text, generates the sentences in that text.
source code
 
sentences_from_tokens(self, tokens)
Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.
source code
 
_annotate_tokens(self, tokens)
Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.
source code
 
_build_sentence_list(self, text, tokens)
Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings.
source code
 
dump(self, tokens) source code
    Annotation Procedures
 
_annotate_second_pass(self, tokens)
Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).
source code
 
_second_pass_annotation(self, aug_tok1, aug_tok2)
Performs token-based classification over a pair of contiguous tokens returning an updated augmented token for the first of them.
source code
 
_ortho_heuristic(self, aug_tok)
Decide whether the given token is the first token in a sentence.
source code
    Word tokenization

Inherited from _PunktBaseClass (private): _tokenize_words

Static Methods [hide private]
    Tokenization
 
_realign_boundaries(sents)
Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.
source code
    Helper Functions

Inherited from _PunktBaseClass: pair_iter

Class Variables [hide private]
    Customization Variables
  PUNCTUATION = (';', ':', ',', '.', '!', '?')
Instance Variables [hide private]

Inherited from _PunktBaseClass (private): _params

Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, train_text=None, verbose=False)
(Constructor)

source code 

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

Overrides: _PunktBaseClass.__init__

train(self, train_text, verbose=False)

source code 

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

tokenize(self, text, realign_boundaries=False)

source code 

Given a text, returns a list of the sentences in that text.

Returns:
list of str
Overrides: api.TokenizerI.tokenize

sentences_from_text(self, text, realign_boundaries=False)

source code 

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

_realign_boundaries(sents)
Static Method

source code 

Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.

For example: "(Sent1.) Sent2." will otherwise be split as:

   ["(Sent1.", ") Sent1."].

This method will produce:

   ["(Sent1.)", "Sent2."].

sentences_from_text_legacy(self, text)

source code 

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.