Package nltk :: Package tokenize :: Module punkt :: Class _PunktBaseClass

Class _PunktBaseClass

object --+
         |
        _PunktBaseClass

Known Subclasses:

Includes common components of PunktTrainer and PunktSentenceTokenizer.

Nested Classes

[hide private]

_Token
The token definition that should be used by this class.

Instance Methods

[hide private]

__init__(self)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature

source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Word tokenization

_tokenize_words(self, plaintext)
Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.

source code

Annotation Procedures

_annotate_first_pass(self, tokens)
Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:

source code

_first_pass_annotation(self, aug_tok)
Performs type-based annotation on a single token.

source code

Static Methods

[hide private]

Helper Functions

pair_iter(it)
Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple.

source code

Instance Variables

[hide private]

_params
The collection of parameters that determines the behavior of the punkt tokenizer.

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self)
(Constructor)

source code

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

Overrides: object.__init__: (inherited documentation)

pair_iter(it)
Static Method

source code

Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element.

_annotate_first_pass(self, tokens)

source code

Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:

'?', '!', and '.' are marked as sentence breaks.
sequences of two or more periods are marked as ellipsis.
any word ending in '.' that's a known abbreviation is marked as an abbreviation.
any other word ending in '.' is marked as a sentence break.

Return these annotations as a tuple of three sets:

sentbreak_toks: The indices of all sentence breaks.
abbrev_toks: The indices of all abbreviations.
ellipsis_toks: The indices of all ellipsis marks.

Class _PunktBaseClass

__init__(self) (Constructor)

pair_iter(it) Static Method

_annotate_first_pass(self, tokens)

init(self)
(Constructor)

pair_iter(it)
Static Method