nltk.tokenize.regexp.WordPunctTokenizer

Class WordPunctTokenizer

object --+ | api.TokenizerI --+ | RegexpTokenizer --+ | WordPunctTokenizer

A tokenizer that divides a text into sequences of alphabetic and non-alphabetic characters. E.g.:

>>> WordPunctTokenizer().tokenize("She said 'hello'.") ['She', 'said', "'", 'hello', "'."]

Instance Methods

[hide private]

__init__(self)
Construct a new tokenizer that splits strings using the given regular expression pattern. source code

Inherited from RegexpTokenizer: __repr__, tokenize

Inherited from api.TokenizerI: batch_tokenize

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Properties

[hide private]

Inherited from object: __class__

init(self)
(Constructor)

source code

Construct a new tokenizer that splits strings using the given regular expression pattern. By default, pattern will be used to find tokens; but if gaps is set to False, then patterns will be used to find separators between tokens instead.

Parameters:

pattern - The pattern used to build this tokenizer. This pattern may safely contain grouping parenthases.
gaps - True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
discard_empty - True if any empty tokens ('') generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps is true.
flags - The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.

Overrides: RegexpTokenizer.__init__

(inherited documentation)

Class WordPunctTokenizer

__init__(self) (Constructor)

init(self)
(Constructor)