Module internals

Exception raised by parse_* functions when they fail.
A base class used to mark deprecated classes.
A counter that auto-increments each time its value is read.
A wrapper around ElementTree Element objects whose main purpose is to provide nicer __repr__ and __str__ methods.
Convert all grouping parenthases in the given regexp pattern to non-grouping parenthases, and return the result.
config_java(bin=None, options=None)
Configure nltk's java interface, by letting nltk know where it can find the java binary, and what extra options (if any) should be passed to java when it is run.
java(cmd, classpath=None, stdin=None, stdout=None, stderr=None, blocking=True)
Execute the given java command, by opening a subprocess that calls java.
parse_str(s, start_position)
If a Python string literal begins at the specified position in the given string, then return a tuple (val, end_position) containing the value of the string literal and the position where it ends.
parse_int(s, start_position)
If an integer begins at the specified position in the given string, then return a tuple (val, end_position) containing the value of the integer and the position where it ends.
parse_number(s, start_position)
If an integer or float begins at the specified position in the given string, then return a tuple (val, end_position) containing the value of the number and the position where it ends.
Returns: True if method overrides some method with the same name in a base class.
Return the method resolution order for cls -- i.e., a list containing cls and all its base classes, in the order in which they would be checked by getattr.
_add_epytext_field(obj, field, message)
Add an epytext @field to a given object's docstring.
A decorator used to mark functions as deprecated.
find_binary(name, path_to_bin=None, env_vars=(), searchpath=(), binary_names=None, url=None, verbose=True)
Search for the binary for a program that is used by nltk.
When python is run from within the nltk/ directory tree, the current directory is included at the beginning of the search path.
A decorator used to mark methods as abstract.
slice_bounds(sequence, slice_obj)
Given a slice, return the corresponding (start, stop) bounds, taking into account None indices and negative indices.
  _java_bin = None
  _java_options = []
  NLTK_JAR = '/Volumes/Data/nltk/trunk/nltk/nltk/nltk.jar'
The location of the NLTK jar file, which is used to communicate with external Java packages (such as Mallet) that do not have a sufficiently powerful native command-line interface.
  _STRING_START_RE = re.compile(r'[uU]?[rR]?("""|\'\'\'|"|\')')
  _PARSE_INT_RE = re.compile(r'-?\d+')
  _PARSE_NUMBER_VALUE = re.compile(r'-?(\d*)(\.?\d*)?')
Convert all grouping parenthases in the given regexp pattern to non-grouping parenthases, and return the result. E.g.:

>>> convert_regexp_to_nongrouping('ab(c(x+)(z*))?d')
  • pattern (str)
Returns: str

config_java(bin=None, options=None)

Configure nltk's java interface, by letting nltk know where it can find the java binary, and what extra options (if any) should be passed to java when it is run.

  • bin (string) - The full path to the java binary. If not specified, then nltk will search the system for a java binary; and if one is not found, it will raise a LookupError exception.
  • options (list of string) - A list of options that should be passed to the java binary when it is called. A common value is ['-Xmx512m'], which tells the java binary to increase the maximum heap size to 512 megabytes. If no options are specified, then do not modify the options list.

java(cmd, classpath=None, stdin=None, stdout=None, stderr=None, blocking=True)

Execute the given java command, by opening a subprocess that calls java. If java has not yet been configured, it will be configured by calling config_java() with no arguments.

  • cmd (list of string) - The java command that should be called, formatted as a list of strings. Typically, the first string will be the name of the java class; and the remaining strings will be arguments for that java class.
  • classpath (string) - A ':' separated list of directories, JAR archives, and ZIP archives to search for class files.
  • stdin, stdout, stderr - Specify the executed programs' standard input, standard output and standard error file handles, respectively. Valid values are subprocess.PIPE, an existing file descriptor (a positive integer), an existing file object, and None. subprocess.PIPE indicates that a new pipe to the child should be created. With None, no redirection will occur; the child's file handles will be inherited from the parent. Additionally, stderr can be subprocess.STDOUT, which indicates that the stderr data from the applications should be captured into the same file handle as for stdout.
  • blocking - If false, then return immediately after spawning the subprocess. In this case, the return value is the Popen object, and not a (stdout, stderr) tuple.
If blocking=True, then return a tuple (stdout, stderr), containing the stdout and stderr outputs generated by the java command if the stdout and stderr parameters were set to subprocess.PIPE; or None otherwise. If blocking=False, then return a subprocess.Popen object.
  • OSError - If the java command returns a nonzero return code.

parse_str(s, start_position)

If a Python string literal begins at the specified position in the given string, then return a tuple (val, end_position) containing the value of the string literal and the position where it ends. Otherwise, raise a ParseError.

parse_int(s, start_position)

If an integer begins at the specified position in the given string, then return a tuple (val, end_position) containing the value of the integer and the position where it ends. Otherwise, raise a ParseError.

parse_number(s, start_position)

If an integer or float begins at the specified position in the given string, then return a tuple (val, end_position) containing the value of the number and the position where it ends. Otherwise, raise a ParseError.


  • method (instance method)
True if method overrides some method with the same name in a base class. This is typically used when defining abstract base classes or interfaces, to allow subclasses to define either of two related methods:
>>> class EaterI:
...     '''Subclass must define eat() or batch_eat().'''
...     def eat(self, food):
...         if overridden(self.batch_eat):
...             return self.batch_eat([food])[0]
...         else:
...             raise NotImplementedError()
...     def batch_eat(self, foods):
...         return [ for food in foods]


Return the method resolution order for cls -- i.e., a list containing cls and all its base classes, in the order in which they would be checked by getattr. For new-style classes, this is just cls.__mro__. For classic classes, this can be obtained by a depth-first left-to-right traversal of __bases__.


A decorator used to mark functions as deprecated. This will cause a warning to be printed the when the function is used. Usage:

>>> @deprecated('Use foo() instead')
>>> def bar(x):
...     print x/10

find_binary(name, path_to_bin=None, env_vars=(), searchpath=(), binary_names=None, url=None, verbose=True)

Search for the binary for a program that is used by nltk.

  • name - The name of the program
  • path_to_bin - The user-supplied binary location, or None.
  • env_vars - A list of environment variable names to check
  • binary_names - A list of alternative binary names to check.
  • searchpath - List of directories to search.


When python is run from within the nltk/ directory tree, the current directory is included at the beginning of the search path. Unfortunately, that means that modules within nltk can sometimes shadow standard library modules. As an example, the stdlib 'inspect' module will attempt to import the stdlib 'tokenzie' module, but will instead end up importing NLTK's 'tokenize' module instead (causing the import to fail).


A decorator used to mark methods as abstract. I.e., methods that are marked by this decorator must be overridden by subclasses. If an abstract method is called (either in the base class or in a subclass that does not override the base class method), it will raise NotImplementedError.

slice_bounds(sequence, slice_obj)

Given a slice, return the corresponding (start, stop) bounds, taking into account None indices and negative indices. The following guarantees are made for the returned start and stop values:

  • 0 <= start <= len(sequence)
  • 0 <= stop <= len(sequence)
  • start <= stop
  • ValueError - If slice_obj.step is not None.