Home | Trees | Indices | Help |
|
---|
|
The Punkt sentence tokenizer. The algorithm for this tokenizer is described in Kiss & Strunk (2006):
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
|
|||
Punkt Word Tokenizer | |||
---|---|---|---|
PunktWordTokenizer | |||
Punkt Parameters | |||
PunktParameters Stores data used to perform sentence boundary detection with punkt. |
|||
PunktToken | |||
PunktToken Stores a token of text with annotations produced during sentence boundary detection. |
|||
Punkt base class | |||
_PunktBaseClass Includes common components of PunktTrainer and PunktSentenceTokenizer. |
|||
Punkt Trainer | |||
PunktTrainer Learns parameters used in Punkt sentence boundary detection. |
|||
Punkt Sentence Tokenizer | |||
PunktSentenceTokenizer A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. |
|
|||
Punkt Word Tokenizer | |||
---|---|---|---|
|
|
|||
Orthographic Context Constants | |||
---|---|---|---|
_ORTHO_BEG_UC = 2 Orthogaphic context: beginning of a sentence with upper case. |
|||
_ORTHO_MID_UC = 4 Orthogaphic context: middle of a sentence with upper case. |
|||
_ORTHO_UNK_UC = 8 Orthogaphic context: unknown position in a sentence with upper case. |
|||
_ORTHO_BEG_LC = 16 Orthogaphic context: beginning of a sentence with lower case. |
|||
_ORTHO_MID_LC = 32 Orthogaphic context: middle of a sentence with lower case. |
|||
_ORTHO_UNK_LC = 64 Orthogaphic context: unknown position in a sentence with lower case. |
|||
_ORTHO_UC = 14 Orthogaphic context: occurs with upper case. |
|||
_ORTHO_LC = 112 Orthogaphic context: occurs with lower case. |
|||
_ORTHO_MAP =
A map from context position and first-letter case to the appropriate orthographic context flag. |
|||
Regular expressions for annotation | |||
_RE_NON_PUNCT = re.compile(r' Matches token types that are not merely punctuation. |
|||
_RE_BOUNDARY_REALIGNMENT = re.compile(r' Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !). |
|||
Punkt Word Tokenizer | |||
_punkt_word_tokenize_regexps =
A list of (regexp, repl) pairs applied in sequence by punkt_word_tokenize. |
|||
_punkt_period_context_regexp = re.compile(r' Regular expression to find only contexts that include a possible sentence boundary within a given text. |
|
_ORTHO_MAPA map from context position and first-letter case to the appropriate orthographic context flag.
|
_RE_NON_PUNCTMatches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)
|
_RE_BOUNDARY_REALIGNMENTUsed to realign punctuation that should be included in a sentence although it follows the period (or ?, !).
|
_punkt_word_tokenize_regexpsA list of (regexp, repl) pairs applied in sequence by punkt_word_tokenize. The resulting string is split on whitespace.
|
_punkt_period_context_regexpRegular expression to find only contexts that include a possible sentence boundary within a given text.
|
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0beta1 on Wed Aug 27 15:08:51 2008 | http://epydoc.sourceforge.net |