Package nltk :: Package tokenize :: Module sexpr
[hide private]
[frames] | no frames]

Module sexpr

source code

A tokenizer that divides strings into s-expressions. E.g.:

>>> sexpr_tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
Classes [hide private]
  SExprTokenizer
A tokenizer that divides strings into s-expressions.
Functions [hide private]
 
sexpr_tokenize(text)
Tokenize the text into s-expressions.
source code
 
demo() source code
Function Details [hide private]

sexpr_tokenize(text)

source code 

Tokenize the text into s-expressions. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parenthases are assumed to mark sexprs. In particular, no special processing is done to exclude parenthases that occur inside strings, or following backslash characters.

If the given expression contains non-matching parenthases, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parenthases will be listed as their own s-expression; and the last partial sexpr with unmatched open parenthases will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
Parameters:
  • text (string or iter(string)) - the string to be tokenized
Returns:
An iterator over tokens (each of which is an s-expression)