Home | Trees | Indices | Help |
|
---|
|
Functions for tokenizing, i.e., dividing text strings into substrings.
|
|||
|
|
|||
WhitespaceTokenizer A tokenizer that divides a string into substrings by treating any sequence of whitespace characters as a separator. |
|||
SpaceTokenizer A tokenizer that divides a string into substrings by treating any single space character as a separator. |
|||
LineTokenizer A tokenizer that divides a string into substrings by treating any single newline character as a separator. |
|||
TabTokenizer A tokenizer that divides a string into substrings by treating any single tab character as a separator. |
|||
BlanklineTokenizer A tokenizer that divides a string into substrings by treating any sequence of blank lines as a separator. |
|||
WordTokenizer A tokenizer that divides a text into sequences of alphabetic characters. |
|||
WordPunctTokenizer A tokenizer that divides a text into sequences of alphabetic and non-alphabetic characters. |
|||
RegexpTokenizer A tokenizer that splits a string into substrings using a regular expression. |
|||
PunktSentenceTokenizer A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. |
|||
PunktWordTokenizer | |||
SExprTokenizer A tokenizer that divides strings into s-expressions. |
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|
Split the given text string, based on the given regular expression pattern. See the documentation for RegexpTokenizer.tokenize() for descriptions of the arguments. |
Tokenize the text into s-expressions. For example: >>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)'] All parenthases are assumed to mark sexprs. In particular, no special processing is done to exclude parenthases that occur inside strings, or following backslash characters. If the given expression contains non-matching parenthases, then the
behavior of the tokenizer depends on the >>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g']
|
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0beta1 on Wed Aug 27 15:08:51 2008 | http://epydoc.sourceforge.net |