Package nltk :: Package tokenize :: Module sexpr :: Class SExprTokenizer
Class SExprTokenizer

    object --+    
api.TokenizerI --+

A tokenizer that divides strings into s-expressions. An s-expresion can be either:

For example, the string '(a (b c)) d e (f)' consists of four s-expressions: '(a (b c))', 'd', 'e', and '(f)'.

__init__(self, parens='()', strict=True)
Construct a new SExpr tokenizer.
tokenize(self, text)
Tokenize the text into s-expressions.
Inherited from api.TokenizerI: batch_tokenize





__init__(self, parens='()', strict=True)

Construct a new SExpr tokenizer. By default, the characters '(' and ')' are treated as open and close parenthases; but alternative strings may be specified.

  • parens (str or list) - A two-element sequence specifying the open and close parenthases that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
  • strict - If true, then raise an exception when tokenizing an ill-formed sexpr.


tokenize(self, text)

Tokenize the text into s-expressions. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parenthases are assumed to mark sexprs. In particular, no special processing is done to exclude parenthases that occur inside strings, or following backslash characters.

If the given expression contains non-matching parenthases, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parenthases will be listed as their own s-expression; and the last partial sexpr with unmatched open parenthases will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
  • text (string or iter(string)) - the string to be tokenized
An iterator over tokens (each of which is an s-expression)
