Package nltk :: Package tokenize :: Module punkt
[hide private]
[frames] | no frames]

Module punkt

source code

The Punkt sentence tokenizer. The algorithm for this tokenizer is described in Kiss & Strunk (2006):

 Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
   Boundary Detection.  Computational Linguistics 32: 485-525.
Classes [hide private]
    Punkt Word Tokenizer
  PunktWordTokenizer
    Punkt Parameters
  PunktParameters
Stores data used to perform sentence boundary detection with punkt.
    PunktToken
  PunktToken
Stores a token of text with annotations produced during sentence boundary detection.
    Punkt base class
  _PunktBaseClass
Includes common components of PunktTrainer and PunktSentenceTokenizer.
    Punkt Trainer
  PunktTrainer
Learns parameters used in Punkt sentence boundary detection.
    Punkt Sentence Tokenizer
  PunktSentenceTokenizer
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
Functions [hide private]
    Punkt Word Tokenizer
 
punkt_word_tokenize(s)
Tokenize a string using the rules from the Punkt word tokenizer.
source code
Variables [hide private]
    Orthographic Context Constants
  _ORTHO_BEG_UC = 2
Orthogaphic context: beginning of a sentence with upper case.
  _ORTHO_MID_UC = 4
Orthogaphic context: middle of a sentence with upper case.
  _ORTHO_UNK_UC = 8
Orthogaphic context: unknown position in a sentence with upper case.
  _ORTHO_BEG_LC = 16
Orthogaphic context: beginning of a sentence with lower case.
  _ORTHO_MID_LC = 32
Orthogaphic context: middle of a sentence with lower case.
  _ORTHO_UNK_LC = 64
Orthogaphic context: unknown position in a sentence with lower case.
  _ORTHO_UC = 14
Orthogaphic context: occurs with upper case.
  _ORTHO_LC = 112
Orthogaphic context: occurs with lower case.
  _ORTHO_MAP = {('initial', 'lower'): 16, ('initial', 'upper'): ...
A map from context position and first-letter case to the appropriate orthographic context flag.
    Regular expressions for annotation
  _RE_NON_PUNCT = re.compile(r'(?u)[^\W\d]')
Matches token types that are not merely punctuation.
  _RE_BOUNDARY_REALIGNMENT = re.compile(r'(?m)["\'\)\]\}]+?(?: |...
Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).
    Punkt Word Tokenizer
  _punkt_word_tokenize_regexps = [(re.compile(r'(?=[\("`\{\[:;&#...
A list of (regexp, repl) pairs applied in sequence by punkt_word_tokenize.
  _punkt_period_context_regexp = re.compile(r'(?ux)\S*([\.\?!])(...
Regular expression to find only contexts that include a possible sentence boundary within a given text.
Variables Details [hide private]

_ORTHO_MAP

A map from context position and first-letter case to the appropriate orthographic context flag.

Value:
{('initial', 'lower'): 16,
 ('initial', 'upper'): 2,
 ('internal', 'lower'): 32,
 ('internal', 'upper'): 4,
 ('unknown', 'lower'): 64,
 ('unknown', 'upper'): 8}

_RE_NON_PUNCT

Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)

Value:
re.compile(r'(?u)[^\W\d]')

_RE_BOUNDARY_REALIGNMENT

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

Value:
re.compile(r'(?m)["\'\)\]\}]+?(?: |(?=--)|$)')

_punkt_word_tokenize_regexps

A list of (regexp, repl) pairs applied in sequence by punkt_word_tokenize. The resulting string is split on whitespace.

Value:
[(re.compile(r'(?=[\("`\{\[:;&#\*@])(.)'), '\\1 '),
 (re.compile(r'(.)(?=[\?!\)";\}\]\*:@\'])'), '\\1 '),
 (re.compile(r'(?=[\)\}\]])(.)'), '\\1 '),
 (re.compile(r'(.)(?=[\(\{\[])'), '\\1 '),
 (re.compile(r'((^|\s)-)(?=[^-])'), '\\1 '),
 (re.compile(r'([^-])(--+)([^-])'), '\\1 \\2 \\3'),
 (re.compile(r'(\s|^)(,)(?=(\S))'), '\\1\\2 '),
 (re.compile(r'(.)(,)(\s|$)'), '\\1 \\2\\3'),
...

_punkt_period_context_regexp

Regular expression to find only contexts that include a possible sentence boundary within a given text.

Value:
re.compile(r'(?ux)\S*([\.\?!])(?:([\?!\)";\}\]\*:@\'\(\{\[])|\s+(\S+))\
')