Package nltk :: Package tokenize :: Module punkt
[hide private]
[frames] | no frames]

Module punkt

source code

The Punkt sentence tokenizer. The algorithm for this tokenizer is described in Kiss & Strunk (2006):

 Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
   Boundary Detection.  Computational Linguistics 32: 485-525.
Classes [hide private]
    Punkt Word Tokenizer
    Punkt Parameters
Stores data used to perform sentence boundary detection with punkt.
Stores a token of text with annotations produced during sentence boundary detection.
    Punkt base class
Includes common components of PunktTrainer and PunktSentenceTokenizer.
    Punkt Trainer
Learns parameters used in Punkt sentence boundary detection.
    Punkt Sentence Tokenizer
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
Functions [hide private]
    Punkt Word Tokenizer
Tokenize a string using the rules from the Punkt word tokenizer.
source code
Variables [hide private]
    Orthographic Context Constants
Orthogaphic context: beginning of a sentence with upper case.
Orthogaphic context: middle of a sentence with upper case.
Orthogaphic context: unknown position in a sentence with upper case.
  _ORTHO_BEG_LC = 16
Orthogaphic context: beginning of a sentence with lower case.
  _ORTHO_MID_LC = 32
Orthogaphic context: middle of a sentence with lower case.
  _ORTHO_UNK_LC = 64
Orthogaphic context: unknown position in a sentence with lower case.
  _ORTHO_UC = 14
Orthogaphic context: occurs with upper case.
  _ORTHO_LC = 112
Orthogaphic context: occurs with lower case.
  _ORTHO_MAP = {('initial', 'lower'): 16, ('initial', 'upper'): ...
A map from context position and first-letter case to the appropriate orthographic context flag.
    Regular expressions for annotation
  _RE_NON_PUNCT = re.compile(r'(?u)[^\W\d]')
Matches token types that are not merely punctuation.
  _RE_BOUNDARY_REALIGNMENT = re.compile(r'(?m)["\'\)\]\}]+?(?: |...
Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).
    Punkt Word Tokenizer
  _punkt_word_tokenize_regexps = [(re.compile(r'(?=[\("`\{\[:;&#...
A list of (regexp, repl) pairs applied in sequence by punkt_word_tokenize.
  _punkt_period_context_regexp = re.compile(r'(?ux)\S*([\.\?!])(...
Regular expression to find only contexts that include a possible sentence boundary within a given text.
Variables Details [hide private]


A map from context position and first-letter case to the appropriate orthographic context flag.

{('initial', 'lower'): 16,
 ('initial', 'upper'): 2,
 ('internal', 'lower'): 32,
 ('internal', 'upper'): 4,
 ('unknown', 'lower'): 64,
 ('unknown', 'upper'): 8}


Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)



Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

re.compile(r'(?m)["\'\)\]\}]+?(?: |(?=--)|$)')


A list of (regexp, repl) pairs applied in sequence by punkt_word_tokenize. The resulting string is split on whitespace.

[(re.compile(r'(?=[\("`\{\[:;&#\*@])(.)'), '\\1 '),
 (re.compile(r'(.)(?=[\?!\)";\}\]\*:@\'])'), '\\1 '),
 (re.compile(r'(?=[\)\}\]])(.)'), '\\1 '),
 (re.compile(r'(.)(?=[\(\{\[])'), '\\1 '),
 (re.compile(r'((^|\s)-)(?=[^-])'), '\\1 '),
 (re.compile(r'([^-])(--+)([^-])'), '\\1 \\2 \\3'),
 (re.compile(r'(\s|^)(,)(?=(\S))'), '\\1\\2 '),
 (re.compile(r'(.)(,)(\s|$)'), '\\1 \\2\\3'),


Regular expression to find only contexts that include a possible sentence boundary within a given text.
