Package nltk :: Package tokenize :: Module regexp :: Class RegexpTokenizer
Class RegexpTokenizer

    object --+    
api.TokenizerI --+
A tokenizer that splits a string into substrings using a regular expression. The regular expression can be specified to match either tokens or separators between tokens.

Unlike re.findall() and re.split(), RegexpTokenizer does not treat regular expressions that contain grouping parenthases specially.

__init__(self, pattern, gaps=False, discard_empty=True, flags=56)
Construct a new tokenizer that splits strings using the given regular expression pattern.
tokenize(self, text)
Divide the given string into a list of substrings.
Inherited from api.TokenizerI: batch_tokenize

The pattern used to build this tokenizer.
True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
True if any empty tokens ('') generated by the tokenizer should be discarded.
The flags used to compile this tokenizer's pattern.
The compiled regular expression used to tokenize texts.
__init__(self, pattern, gaps=False, discard_empty=True, flags=56)

Construct a new tokenizer that splits strings using the given regular expression pattern. By default, pattern will be used to find tokens; but if gaps is set to False, then patterns will be used to find separators between tokens instead.

  • pattern (str) - The pattern used to build this tokenizer. This pattern may safely contain grouping parenthases.
  • gaps (bool) - True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
  • discard_empty (bool) - True if any empty tokens ('') generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps is true.
  • flags (int) - The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
tokenize(self, text)

Divide the given string into a list of substrings.

list of str
(Representation operator)

True if any empty tokens ('') generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps is true.