nltk.tokenize.regexp.RegexpTokenizer

A tokenizer that splits a string into substrings using a regular expression. The regular expression can be specified to match either tokens or separators between tokens.

Unlike re.findall() and re.split(), RegexpTokenizer does not treat regular expressions that contain grouping parenthases specially.

init(self, pattern, gaps=False, discard_empty=True, flags=56)
(Constructor)

source code

Construct a new tokenizer that splits strings using the given regular expression pattern. By default, pattern will be used to find tokens; but if gaps is set to False, then patterns will be used to find separators between tokens instead.

Parameters:

pattern (str) - The pattern used to build this tokenizer. This pattern may safely contain grouping parenthases.
gaps (bool) - True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
discard_empty (bool) - True if any empty tokens ('') generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps is true.
flags (int) - The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.

Overrides: object.__init__

Class RegexpTokenizer

init(self, pattern, gaps=False, discard_empty=True, flags=56)
(Constructor)

tokenize(self, text)

repr(self)
(Representation operator)

_discard_empty

Class RegexpTokenizer

__init__(self, pattern, gaps=False, discard_empty=True, flags=56) (Constructor)

tokenize(self, text)

__repr__(self) (Representation operator)

_discard_empty

init(self, pattern, gaps=False, discard_empty=True, flags=56)
(Constructor)

repr(self)
(Representation operator)