nltk.tokenize

Package tokenize

Functions for tokenizing, i.e., dividing text strings into substrings.

Submodules

nltk.tokenize.api: Tokenizer Interface
nltk.tokenize.punkt: The Punkt sentence tokenizer.
nltk.tokenize.regexp: Tokenizers that divide strings into substrings using regular expressions that can match either tokens or separators between tokens.
nltk.tokenize.sexpr: A tokenizer that divides strings into s-expressions.
nltk.tokenize.simple: Tokenizers that divide strings into substrings using the string split() method.

Classes

WhitespaceTokenizer
A tokenizer that divides a string into substrings by treating any sequence of whitespace characters as a separator.

SpaceTokenizer
A tokenizer that divides a string into substrings by treating any single space character as a separator.

LineTokenizer
A tokenizer that divides a string into substrings by treating any single newline character as a separator.

TabTokenizer
A tokenizer that divides a string into substrings by treating any single tab character as a separator.

BlanklineTokenizer
A tokenizer that divides a string into substrings by treating any sequence of blank lines as a separator.

WordTokenizer
A tokenizer that divides a text into sequences of alphabetic characters.

WordPunctTokenizer
A tokenizer that divides a text into sequences of alphabetic and non-alphabetic characters.

RegexpTokenizer
A tokenizer that splits a string into substrings using a regular expression.

PunktSentenceTokenizer
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.

PunktWordTokenizer

SExprTokenizer
A tokenizer that divides strings into s-expressions.

Functions

[hide private]

line_tokenize(text, blanklines='discard') source code

regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)
Split the given text string, based on the given regular expression pattern.

source code

blankline_tokenize(text)

source code

word_tokenize(text)

source code

wordpunct_tokenize(text)

source code

punkt_word_tokenize(s)
Tokenize a string using the rules from the Punkt word tokenizer.

source code

sexpr_tokenize(text)
Tokenize the text into s-expressions.

source code

Function Details

[hide private]

regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)

source code

Split the given text string, based on the given regular expression pattern. See the documentation for RegexpTokenizer.tokenize() for descriptions of the arguments.

sexpr_tokenize(text)

source code

Tokenize the text into s-expressions. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parenthases are assumed to mark sexprs. In particular, no special processing is done to exclude parenthases that occur inside strings, or following backslash characters.

If the given expression contains non-matching parenthases, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parenthases will be listed as their own s-expression; and the last partial sexpr with unmatched open parenthases will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']

Parameters:

text (string or iter(string)) - the string to be tokenized

Returns:

An iterator over tokens (each of which is an s-expression)