Package nltk :: Package tokenize :: Module regexp :: Class RegexpTokenizer
[hide private]
[frames] | no frames]

Class RegexpTokenizer

source code

    object --+    
             |    
api.TokenizerI --+
                 |
                RegexpTokenizer
Known Subclasses:

A tokenizer that splits a string into substrings using a regular expression. The regular expression can be specified to match either tokens or separators between tokens.

Unlike re.findall() and re.split(), RegexpTokenizer does not treat regular expressions that contain grouping parenthases specially.

Instance Methods [hide private]
 
__init__(self, pattern, gaps=False, discard_empty=True, flags=56)
Construct a new tokenizer that splits strings using the given regular expression pattern.
source code
 
tokenize(self, text)
Divide the given string into a list of substrings.
source code
 
__repr__(self)
repr(x)
source code

Inherited from api.TokenizerI: batch_tokenize

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Instance Variables [hide private]
  _pattern
The pattern used to build this tokenizer.
  _gaps
True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
  _discard_empty
True if any empty tokens ('') generated by the tokenizer should be discarded.
  _flags
The flags used to compile this tokenizer's pattern.
  _regexp
The compiled regular expression used to tokenize texts.
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, pattern, gaps=False, discard_empty=True, flags=56)
(Constructor)

source code 

Construct a new tokenizer that splits strings using the given regular expression pattern. By default, pattern will be used to find tokens; but if gaps is set to False, then patterns will be used to find separators between tokens instead.

Parameters:
  • pattern (str) - The pattern used to build this tokenizer. This pattern may safely contain grouping parenthases.
  • gaps (bool) - True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
  • discard_empty (bool) - True if any empty tokens ('') generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps is true.
  • flags (int) - The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
Overrides: object.__init__

tokenize(self, text)

source code 

Divide the given string into a list of substrings.

Returns:
list of str
Overrides: api.TokenizerI.tokenize
(inherited documentation)

__repr__(self)
(Representation operator)

source code 

repr(x)

Overrides: object.__repr__
(inherited documentation)

Instance Variable Details [hide private]

_discard_empty

True if any empty tokens ('') generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps is true.