Package nltk :: Package tokenize :: Module sexpr :: Class SExprTokenizer
[hide private]
[frames] | no frames]

Class SExprTokenizer

source code

    object --+    
             |    
api.TokenizerI --+
                 |
                SExprTokenizer

A tokenizer that divides strings into s-expressions. An s-expresion can be either:

For example, the string '(a (b c)) d e (f)' consists of four s-expressions: '(a (b c))', 'd', 'e', and '(f)'.

Instance Methods [hide private]
 
__init__(self, parens='()', strict=True)
Construct a new SExpr tokenizer.
source code
 
tokenize(self, text)
Tokenize the text into s-expressions.
source code

Inherited from api.TokenizerI: batch_tokenize

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, parens='()', strict=True)
(Constructor)

source code 

Construct a new SExpr tokenizer. By default, the characters '(' and ')' are treated as open and close parenthases; but alternative strings may be specified.

Parameters:
  • parens (str or list) - A two-element sequence specifying the open and close parenthases that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
  • strict - If true, then raise an exception when tokenizing an ill-formed sexpr.
Overrides: object.__init__

tokenize(self, text)

source code 

Tokenize the text into s-expressions. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parenthases are assumed to mark sexprs. In particular, no special processing is done to exclude parenthases that occur inside strings, or following backslash characters.

If the given expression contains non-matching parenthases, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parenthases will be listed as their own s-expression; and the last partial sexpr with unmatched open parenthases will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
Parameters:
  • text (string or iter(string)) - the string to be tokenized
Returns:
An iterator over tokens (each of which is an s-expression)
Overrides: api.TokenizerI.tokenize