Package nltk :: Package tokenize :: Module punkt :: Class _PunktBaseClass
[hide private]
[frames] | no frames]

Class _PunktBaseClass

source code

object --+
         |
        _PunktBaseClass
Known Subclasses:

Includes common components of PunktTrainer and PunktSentenceTokenizer.

Nested Classes [hide private]
  _Token
The token definition that should be used by this class.
Instance Methods [hide private]
 
__init__(self)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

    Word tokenization
 
_tokenize_words(self, plaintext)
Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.
source code
    Annotation Procedures
 
_annotate_first_pass(self, tokens)
Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:
source code
 
_first_pass_annotation(self, aug_tok)
Performs type-based annotation on a single token.
source code
Static Methods [hide private]
    Helper Functions
 
pair_iter(it)
Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple.
source code
Instance Variables [hide private]
  _params
The collection of parameters that determines the behavior of the punkt tokenizer.
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self)
(Constructor)

source code 

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

Overrides: object.__init__
(inherited documentation)

pair_iter(it)
Static Method

source code 

Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element.

_annotate_first_pass(self, tokens)

source code 

Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:

  • '?', '!', and '.' are marked as sentence breaks.
  • sequences of two or more periods are marked as ellipsis.
  • any word ending in '.' that's a known abbreviation is marked as an abbreviation.
  • any other word ending in '.' is marked as a sentence break.

Return these annotations as a tuple of three sets:

  • sentbreak_toks: The indices of all sentence breaks.
  • abbrev_toks: The indices of all abbreviations.
  • ellipsis_toks: The indices of all ellipsis marks.