Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the list of sentences or words in a string. Tokenizers are implemented in NLTK as subclasses of the nltk.tokenize.TokenizerI interface, which defines the tokenize() method. Here's an example of their use:
|
Many of the tokenization methods are also available as simple functions, which save you the trouble of creating a tokenizer object every time you want to tokenize a string. For example:
|
The following tokenizers, defined in nltk.tokenize.simple, just divide the string using the string split() method.
|
|
|
|
|
|
The simple tokenizers are not available as separate functions; instead, you should just use the string split() method directly:
|
The simple tokenizers are mainly useful because they follow the standard TokenizerI interface, and so can be used with any code that expects a tokenizer. For example, these tokenizers can be used to specify the tokenization conventions when building a CorpusReader.
RegexpTokenizer splits a string into substrings using a regular expression. By default, any substrings matched by this regexp will be returned as tokens. For example, the following tokenizer selects out only capitalized words, and throws everything else away:
|
The following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:
|
RegexpTokenizers can be told to use their regexp pattern to match separators between tokens, using gaps=True:
|
The nltk.tokenize.regexp module contains several subclasses of RegexpTokenizer that use pre-defined regular expressions:
|
|
|
All of the regular expression tokenizers are also available as simple functions:
|
Warning
The function regexp_tokenize() takes the text as its first argument, and the regular expression pattern as its second argument. This differs from the conventions used by Python's re functions, where the pattern is always the first argument. But regexp_tokenize() is primarily a tokenization function, so we chose to follow the convention among other tokenization functions that the text should always be the first argument.
SExprTokenizer is used to find parenthasized expressions in a string. In particular, it divides a string into a sequence of substrings that are either parenthasized expressions (including any nested parenthasized expressions), or other whitespace-separated tokens.
|
By default, SExprTokenizer will raise a ValueError exception if used to tokenize an expression with non-matching parenthases:
|
But the strict argument can be set to False to allow for non-matching parenthases. Any unmatched close parenthases will be listed as their own s-expression; and the last partial sexpr with unmatched open parenthases will be listed as its own sexpr:
|
The characters used for open and close parenthases may be customized using the parens argument to the SExprTokenizer constructor:
|
The s-expression tokenizer is also available as a function:
|
The PunktSentenceTokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the taret language before it can be used. The algorithm for this tokenizer is described in Kiss & Strunk (2006):
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
The NLTK data package includes a pre-trained Punkt tokenizer for English.
|
(Note that whitespace from the original text, including newlines, is retained in the output.)
The nltk.tokenize.punkt module also defines PunktWordTokenizer, which uses a series of regular expressions to divide a word into tokens:
|
This word tokenizer is also available as a function:
|
Some additional test strings.
|
|
Make sure that grouping parenthases don't confuse the tokenizer:
|
Make sure that named groups don't confuse the tokenizer:
|
Make sure that nested groups don't confuse the tokenizer:
|
The tokenizer should reject any patterns with backreferences:
|
A simple sentence tokenizer '.(s+|$)'
|