Package nltk :: Package tokenize :: Module punkt

Module punkt

The Punkt sentence tokenizer. The algorithm for this tokenizer is described in Kiss & Strunk (2006):

 Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
   Boundary Detection.  Computational Linguistics 32: 485-525.

Classes

[hide private]

Punkt Word Tokenizer

PunktWordTokenizer

Punkt Parameters

PunktParameters
Stores data used to perform sentence boundary detection with punkt.

PunktToken

PunktToken
Stores a token of text with annotations produced during sentence boundary detection.

Punkt base class

_PunktBaseClass
Includes common components of PunktTrainer and PunktSentenceTokenizer.

Punkt Trainer

PunktTrainer
Learns parameters used in Punkt sentence boundary detection.

Punkt Sentence Tokenizer

PunktSentenceTokenizer
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.

Functions

[hide private]

Punkt Word Tokenizer

punkt_word_tokenize(s)
Tokenize a string using the rules from the Punkt word tokenizer.

source code

Variables

[hide private]

Orthographic Context Constants

_ORTHO_BEG_UC = 2
Orthogaphic context: beginning of a sentence with upper case.

_ORTHO_MID_UC = 4
Orthogaphic context: middle of a sentence with upper case.

_ORTHO_UNK_UC = 8
Orthogaphic context: unknown position in a sentence with upper case.

_ORTHO_BEG_LC = 16
Orthogaphic context: beginning of a sentence with lower case.

_ORTHO_MID_LC = 32
Orthogaphic context: middle of a sentence with lower case.

_ORTHO_UNK_LC = 64
Orthogaphic context: unknown position in a sentence with lower case.

_ORTHO_UC = 14
Orthogaphic context: occurs with upper case.

_ORTHO_LC = 112
Orthogaphic context: occurs with lower case.

_ORTHO_MAP = {('initial', 'lower'): 16, ('initial', 'upper'): ...
A map from context position and first-letter case to the appropriate orthographic context flag.

Regular expressions for annotation

_RE_NON_PUNCT = re.compile(r'(?u)[^\W\d]')
Matches token types that are not merely punctuation.

_RE_BOUNDARY_REALIGNMENT = re.compile(r'(?m)["\'\)\]\}]+?(?: |...
Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

Punkt Word Tokenizer

_punkt_word_tokenize_regexps = [(re.compile(r'(?=[\("`\{\[:;&#...
A list of (regexp, repl) pairs applied in sequence by punkt_word_tokenize.

_punkt_period_context_regexp = re.compile(r'(?ux)\S*([\.\?!])(...
Regular expression to find only contexts that include a possible sentence boundary within a given text.

Variables Details

[hide private]

_ORTHO_MAP

A map from context position and first-letter case to the appropriate orthographic context flag.

Value:

{('initial', 'lower'): 16,
 ('initial', 'upper'): 2,
 ('internal', 'lower'): 32,
 ('internal', 'upper'): 4,
 ('unknown', 'lower'): 64,
 ('unknown', 'upper'): 8}

_RE_NON_PUNCT

Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)

Value:

re.compile(r'(?u)[^\W\d]')

_RE_BOUNDARY_REALIGNMENT

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

Value:

re.compile(r'(?m)["\'\)\]\}]+?(?: |(?=--)|$)')

_punkt_word_tokenize_regexps

A list of (regexp, repl) pairs applied in sequence by punkt_word_tokenize. The resulting string is split on whitespace.

Value:

[(re.compile(r'(?=[\("`\{\[:;&#\*@])(.)'), '\\1 '),
 (re.compile(r'(.)(?=[\?!\)";\}\]\*:@\'])'), '\\1 '),
 (re.compile(r'(?=[\)\}\]])(.)'), '\\1 '),
 (re.compile(r'(.)(?=[\(\{\[])'), '\\1 '),
 (re.compile(r'((^|\s)-)(?=[^-])'), '\\1 '),
 (re.compile(r'([^-])(--+)([^-])'), '\\1 \\2 \\3'),
 (re.compile(r'(\s|^)(,)(?=(\S))'), '\\1\\2 '),
 (re.compile(r'(.)(,)(\s|$)'), '\\1 \\2\\3'),
...

_punkt_period_context_regexp

Regular expression to find only contexts that include a possible sentence boundary within a given text.

Value:

re.compile(r'(?ux)\S*([\.\?!])(?:([\?!\)";\}\]\*:@\'\(\{\[])|\s+(\S+))
')