Package nltk :: Package tokenize
[hide private]
[frames] | no frames]

Package tokenize

source code

Functions for tokenizing, i.e., dividing text strings into substrings.

Submodules [hide private]

Classes [hide private]
  WhitespaceTokenizer
A tokenizer that divides a string into substrings by treating any sequence of whitespace characters as a separator.
  SpaceTokenizer
A tokenizer that divides a string into substrings by treating any single space character as a separator.
  LineTokenizer
A tokenizer that divides a string into substrings by treating any single newline character as a separator.
  TabTokenizer
A tokenizer that divides a string into substrings by treating any single tab character as a separator.
  BlanklineTokenizer
A tokenizer that divides a string into substrings by treating any sequence of blank lines as a separator.
  WordTokenizer
A tokenizer that divides a text into sequences of alphabetic characters.
  WordPunctTokenizer
A tokenizer that divides a text into sequences of alphabetic and non-alphabetic characters.
  RegexpTokenizer
A tokenizer that splits a string into substrings using a regular expression.
  PunktSentenceTokenizer
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
  PunktWordTokenizer
  SExprTokenizer
A tokenizer that divides strings into s-expressions.
Functions [hide private]
 
line_tokenize(text, blanklines='discard') source code
 
regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)
Split the given text string, based on the given regular expression pattern.
source code
 
blankline_tokenize(text) source code
 
word_tokenize(text) source code
 
wordpunct_tokenize(text) source code
 
punkt_word_tokenize(s)
Tokenize a string using the rules from the Punkt word tokenizer.
source code
 
sexpr_tokenize(text)
Tokenize the text into s-expressions.
source code
Function Details [hide private]

regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)

source code 

Split the given text string, based on the given regular expression pattern. See the documentation for RegexpTokenizer.tokenize() for descriptions of the arguments.

sexpr_tokenize(text)

source code 

Tokenize the text into s-expressions. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parenthases are assumed to mark sexprs. In particular, no special processing is done to exclude parenthases that occur inside strings, or following backslash characters.

If the given expression contains non-matching parenthases, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parenthases will be listed as their own s-expression; and the last partial sexpr with unmatched open parenthases will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
Parameters:
  • text (string or iter(string)) - the string to be tokenized
Returns:
An iterator over tokens (each of which is an s-expression)