Package nltk :: Package tokenize :: Module regexp
[hide private]
[frames] | no frames]

Module regexp

source code

Tokenizers that divide strings into substrings using regular expressions that can match either tokens or separators between tokens.

Classes [hide private]
  RegexpTokenizer
A tokenizer that splits a string into substrings using a regular expression.
  BlanklineTokenizer
A tokenizer that divides a string into substrings by treating any sequence of blank lines as a separator.
  WordPunctTokenizer
A tokenizer that divides a text into sequences of alphabetic and non-alphabetic characters.
  WordTokenizer
A tokenizer that divides a text into sequences of alphabetic characters.
Functions [hide private]
    Tokenization Functions
 
regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)
Split the given text string, based on the given regular expression pattern.
source code
 
blankline_tokenize(text) source code
 
wordpunct_tokenize(text) source code
 
word_tokenize(text) source code
Function Details [hide private]

regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)

source code 

Split the given text string, based on the given regular expression pattern. See the documentation for RegexpTokenizer.tokenize() for descriptions of the arguments.