Package nltk :: Package tokenize :: Module regexp

Module regexp

Tokenizers that divide strings into substrings using regular expressions that can match either tokens or separators between tokens.

Classes

[hide private]

RegexpTokenizer
A tokenizer that splits a string into substrings using a regular expression.

BlanklineTokenizer
A tokenizer that divides a string into substrings by treating any sequence of blank lines as a separator.

WordPunctTokenizer
A tokenizer that divides a text into sequences of alphabetic and non-alphabetic characters.

WordTokenizer
A tokenizer that divides a text into sequences of alphabetic characters.

Functions

[hide private]

Tokenization Functions

regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)
Split the given text string, based on the given regular expression pattern.

source code

blankline_tokenize(text)

source code

wordpunct_tokenize(text)

source code

word_tokenize(text)

source code

Function Details

[hide private]

regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)

source code

Split the given text string, based on the given regular expression pattern. See the documentation for RegexpTokenizer.tokenize() for descriptions of the arguments.