Package nltk :: Module data
[hide private]
[frames] | no frames]

Module data

source code

Functions to find and load NLTK resource files, such as corpora, grammars, and saved processing objects. Resource files are identified using URLs, such as"nltk:corpora/abc/rural.txt" or "". The following URL protocols are supported:

If no protocol is specified, then the default protocol "nltk:" will be used.

This module provides to functions that can be used to access a resource file, given its URL: load() loads a given resource, and adds it to a resource cache; and retrieve() copies a given resource to a local file.

Classes [hide private]
An abstract base class for 'path pointers,' used by NLTK's data package to identify specific paths.
A path pointer that identifies a file which can be accessed directly via a given absolute path.
A subclass of FileSystemPathPointer that identifies a gzip-compressed file located at a given absolute path.
A path pointer that identifies a file contained within a zipfile, which can be accessed by reading that zipfile.
A subclass of zipfile.ZipFile that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile.
    Seekable Unicode Stream Reader
A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly.
Functions [hide private]
Find the given resource from the NLTK data package, and return a corresponding path name.
source code
retrieve(resource_url, filename=None, verbose=True)
Copy the given resource to a local file.
source code
load(resource_url, format='auto', cache=True, verbose=False)
Load a given resource from the NLTK data package.
source code
show_cfg(resource_url, escape='##')
Write out a grammar file, ignoring escaped and empty lines
source code
Remove all objects from the resource cache.
source code
Helper function that returns an open file object for a resource, given its resource URL.
source code
Variables [hide private]
  path = ['/users/sb/nltk/data', '/Users/sb/nltk_data', '/usr/sh...
A list of directories where the NLTK data package might reside.
  _resource_cache = <WeakValueDictionary at 8144776>
A weakref dictionary used to cache resources so that they won't need to be loaded more than once.
  FORMATS = {'cfg': 'A context free grammar, parsed by nltk.cfg....
A dictionary describing the formats that are supported by NLTK's load() method.
  AUTO_FORMATS = {'cfg': 'cfg', 'fcfg': 'fcfg', 'fol': 'fol', 'p...
A dictionary mapping from file extensions to format names, used by load() when format="auto" to decide the format for a given resource url.
  d = '/users/sb/nltk/data'
Function Details [hide private]


source code 

Find the given resource from the NLTK data package, and return a corresponding path name. If the given resource is not found, raise a LookupError, whose message gives a pointer to the installation instructions for the NLTK data package.

  • resource_name (str) - The name of the resource to search for. Resource names are posix-style relative path names, such as 'corpora/brown'. In particular, directory names should always be separated by the '/' character, which will be automatically converted to a platform-appropriate path separator.
Returns: str

retrieve(resource_url, filename=None, verbose=True)

source code 

Copy the given resource to a local file. If no filename is specified, then use the URL's filename. If there is already a file named filename, then raise a ValueError.

  • resource_url (str) - A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.

load(resource_url, format='auto', cache=True, verbose=False)

source code 

Load a given resource from the NLTK data package. The following resource formats are currently supported:

  • 'pickle'
  • 'yaml'
  • 'cfg' (context free grammars)
  • 'pcfg' (probabilistic CFGs)
  • 'fcfg' (feature-based CFGs)
  • 'fol' (formulas of First Order Logic)
  • 'val' (valuation of First Order Logic model)
  • 'raw'

If no format is specified, load() will attempt to determine a format based on the resource name's file extension. If that fails, load() will raise a ValueError exception.

  • resource_url (str) - A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
  • cache (bool) - If true, add this resource to a cache. If load finds a resource in its cache, then it will return it from the cache rather than loading it. The cache uses weak references, so a resource wil automatically be expunged from the cache when no more objects are using it.
  • verbose (bool) - If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache.

show_cfg(resource_url, escape='##')

source code 

Write out a grammar file, ignoring escaped and empty lines

  • resource_url (str) - A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
  • escape (str) - Prepended string that signals lines to be ignored


source code 

Remove all objects from the resource cache.

See Also: load()


source code 

Helper function that returns an open file object for a resource, given its resource URL. If the given resource URL uses the 'ntlk' protocol, or uses no protocol, then use to find its path, and open it with the given mode; if the resource URL uses the 'file' protocol, then open the file with the given mode; otherwise, delegate to urllib2.urlopen.

  • resource_url (str) - A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.

Variables Details [hide private]


A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk/data).



A dictionary describing the formats that are supported by NLTK's load() method. Keys are format names, and values are format descriptions.

{'cfg': 'A context free grammar, parsed by nltk.cfg.parse_cfg().',
 'fcfg': 'A feature CFG, parsed by nltk.cfg.parse_fcfg().',
 'fol': 'A list of first order logic expressions, parsed by nltk.sem.p\
 'pcfg': 'A probabilistic CFG, parsed by nltk.cfg.parse_pcfg().',
 'pickle': 'A serialized python object, stored using the pickle module\
 'raw': 'The raw (byte string) contents of a file.',


A dictionary mapping from file extensions to format names, used by load() when format="auto" to decide the format for a given resource url.

{'cfg': 'cfg',
 'fcfg': 'fcfg',
 'fol': 'fol',
 'pcfg': 'pcfg',
 'pickle': 'pickle',
 'val': 'val',
 'yaml': 'yaml'}