Package nltk :: Package tokenize :: Module regexp :: Class WordTokenizer
[hide private]
[frames] | no frames]

Class WordTokenizer

source code

    object --+        
             |        
api.TokenizerI --+    
                 |    
   RegexpTokenizer --+
                     |
                    WordTokenizer

A tokenizer that divides a text into sequences of alphabetic characters. Any non-alphabetic characters are discarded. E.g.:

>>> WordTokenizer().tokenize("She said 'hello'.")
['She', 'said', 'hello']
Instance Methods [hide private]
 
__init__(self)
Construct a new tokenizer that splits strings using the given regular expression pattern.
source code

Inherited from RegexpTokenizer: __repr__, tokenize

Inherited from api.TokenizerI: batch_tokenize

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Instance Variables [hide private]

Inherited from RegexpTokenizer (private): _discard_empty, _flags, _gaps, _pattern, _regexp

Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self)
(Constructor)

source code 

Construct a new tokenizer that splits strings using the given regular expression pattern. By default, pattern will be used to find tokens; but if gaps is set to False, then patterns will be used to find separators between tokens instead.

Parameters:
  • pattern - The pattern used to build this tokenizer. This pattern may safely contain grouping parenthases.
  • gaps - True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
  • discard_empty - True if any empty tokens ('') generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps is true.
  • flags - The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
Overrides: RegexpTokenizer.__init__
(inherited documentation)