nltk.tokenize.sexpr.SExprTokenizer

A tokenizer that divides strings into s-expressions. An s-expresion can be either:

For example, the string '(a (b c)) d e (f)' consists of four s-expressions: '(a (b c))', 'd', 'e', and '(f)'.

init(self, parens=`'()'`, strict=True)
(Constructor)

Construct a new SExpr tokenizer. By default, the characters '(' and ')' are treated as open and close parenthases; but alternative strings may be specified.

Parameters:

parens (str or list) - A two-element sequence specifying the open and close parenthases that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
strict - If true, then raise an exception when tokenizing an ill-formed sexpr.

Overrides: object.__init__

tokenize(self, text)

source code

Tokenize the text into s-expressions. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parenthases are assumed to mark sexprs. In particular, no special processing is done to exclude parenthases that occur inside strings, or following backslash characters.

If the given expression contains non-matching parenthases, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parenthases will be listed as their own s-expression; and the last partial sexpr with unmatched open parenthases will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']

Parameters:

text (string or iter(string)) - the string to be tokenized

Returns:

An iterator over tokens (each of which is an s-expression)

Overrides: api.TokenizerI.tokenize

Class SExprTokenizer

__init__(self, parens='()', strict=True) (Constructor)

tokenize(self, text)

init(self, parens=`'()'`, strict=True)
(Constructor)