Loading Resources From the Data Package

>>> import nltk.data

1 Overview

The nltk.data module contains functions that can be used to load NLTK resource files, such as corpora, grammars, and saved processing objects.

2 Loading Data Files

Resources are loaded using the function nltk.data.load(), which takes as its first argument a URL specifying what file should be loaded. The nltk: protocol loads files from the NLTK data distribution:

>>> tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize('Hello.  This is a test.  It works!')
['Hello.', 'This is a test.', 'It works!']

The nltk: protocol is used by default if no protocol is specified:

>>> nltk.data.load('tokenizers/punkt/english.pickle') 
<nltk.tokenize.punkt.PunktSentenceTokenizer object at ...>

But it is also possible to load resources from http:, ftp:, and file: URLs:

>>> # Load a grammar from the NLTK webpage.
>>> cfg = nltk.data.load('http://nltk.org/examples/parse/toy.cfg')
>>> print cfg 
Grammar with 14 productions (start state = S)
    S -> NP VP
    PP -> P NP
    ...
    P -> 'on'
    P -> 'in'

>>> # Load a grammar using an absolute path.
>>> url = 'file:%s' % nltk.data.find('grammars/toy.cfg')
>>> url 
'file:/.../grammars/toy.cfg'
>>> print nltk.data.load(url) 
Grammar with 14 productions (start state = S)
    S -> NP VP
    PP -> P NP
    ...
    P -> 'on'
    P -> 'in'

The second argument to the nltk.data.load() function specifies the file format, which determines how the file's contents are processed before they are returned by load(). The formats that are currently supported by the data module are described by the dictionary nltk.data.FORMATS:

>>> for format, descr in sorted(nltk.data.FORMATS.items()):
...     print '%-7s %s' % (format, descr) 
cfg     A context free grammar, parsed by nltk.cfg.parse_cfg().
fcfg    A feature CFG, parsed by nltk.cfg.parse_fcfg().
fol     A list of first order logic expressions, parsed by
          nltk.sem.parse_fol().
pcfg    A probabilistic CFG, parsed by nltk.cfg.parse_pcfg().
pickle  A serialized python object, stored using the pickle module.
raw     The raw (byte string) contents of a file.
val     A semantic valuation, parsed by nltk.sem.parse_valuation().
yaml    A serialzied python object, stored using the yaml module.

nltk.data.load() will raise a ValueError if a bad format name is specified:

>>> nltk.data.load('grammars/toy.cfg', 'bar')
Traceback (most recent call last):
  . . .
ValueError: Unknown format type!

By default, the "auto" format is used, which chooses a format based on the filename's extension. The mapping from file extensions to format names is specified by nltk.data.AUTO_FORMATS:

>>> for ext, format in sorted(nltk.data.AUTO_FORMATS.items()):
...     print '.%-7s -> %s' % (ext, format)
.cfg     -> cfg
.fcfg    -> fcfg
.fol     -> fol
.pcfg    -> pcfg
.pickle  -> pickle
.val     -> val
.yaml    -> yaml

If nltk.data.load() is unable to determine the format based on the filename's extension, it will raise a ValueError:

>>> nltk.data.load('foo.bar')
Traceback (most recent call last):
  . . .
ValueError: Could not determine format for foo.bar based on its file
extension; use the "format" argument to specify the format explicitly.

Note that by explicitly specifying the format argument, you can override the load method's default processing behavior. For example, to get the raw contents of any file, simply use format="raw":

>>> nltk.data.load('grammars/toy.cfg', 'raw') 
"S -> NP VP\nPP -> P NP\nNP -> Det N | NP PP\nVP -> V NP | VP PP\n..."

3 Making Local Copies

The function nltk.data.retrieve() copies a given resource to a local file. This can be useful, for example, if you want to edit one of the sample grammars.

>>> nltk.data.retrieve('grammars/toy.cfg')
Retrieving 'grammars/toy.cfg', saving to 'toy.cfg'

>>> # Simulate editing the grammar.
>>> s = open('toy.cfg').read().replace('NP', 'DP')
>>> out = open('toy.cfg', 'w'); out.write(s); out.close()

>>> # Load the edited grammar, & display it.
>>> cfg = nltk.data.load('file:toy.cfg')
>>> print cfg 
Grammar with 14 productions (start state = S)
    S -> DP VP
    PP -> P DP
    ...
    P -> 'on'
    P -> 'in'

The second argument to nltk.data.retrieve() specifies the filename for the new copy of the file. By default, the source file's filename is used.

>>> nltk.data.retrieve('grammars/toy.cfg', 'mytoy.cfg')
Retrieving 'grammars/toy.cfg', saving to 'mytoy.cfg'
>>> os.path.isfile('./mytoy.cfg')
True
>>> nltk.data.retrieve('grammars/np.fcfg')
Retrieving 'grammars/np.fcfg', saving to 'np.fcfg'
>>> os.path.isfile('./np.fcfg')
True

If a file with the specified (or default) filename already exists in the current directory, then nltk.data.retrieve() will raise a ValueError exception. It will not overwrite the file:

>>> os.path.isfile('./toy.cfg')
True
>>> nltk.data.retrieve('grammars/toy.cfg') 
Traceback (most recent call last):
  . . .
ValueError: File '.../toy.cfg' already exists!

4 Finding Files in the NLTK Data Package

The nltk.data.find() function searches the NLTK data package for a given file, and returns a pointer to that file. This pointer can either be a FileSystemPathPointer (whose path attribute gives the absolute path of the file); or a ZipFilePathPointer, specifying a zipfile and the name of an entry within that zipfile. Both pointer types define the open() method, which can be used to read the string contents of the file.

>>> path = nltk.data.find('corpora/abc/rural.txt')
>>> path 
FileSystemPathPointer('.../corpora/abc/rural.txt')
>>> path.open().read(60)
'PM denies knowledge of AWB kickbacks\nThe Prime Minister has '

Alternatively, the nltk.data.load() function can be used with the keyword argument format="raw":

>>> nltk.data.load('corpora/abc/rural.txt', format='raw')[:60]
'PM denies knowledge of AWB kickbacks\nThe Prime Minister has '

5 Resource Caching

NLTK uses a weakref dictionary to maintain a cache of resources that have been loaded. If you load a resource that is already stored in the cache, then the cached copy will be returned. This behavior can be seen by the trace output generated when verbose=True:

>>> feat0 = nltk.data.load('grammars/feat0.fcfg', verbose=True)
<<Loading grammars/feat0.fcfg>>
>>> feat0 = nltk.data.load('grammars/feat0.fcfg', verbose=True)
<<Using cached copy of grammars/feat0.fcfg>>

If you wish to load a resource from its source, bypassing the cache, use the cache=False argument to nltk.data.load(). This can be useful, for example, if the resource is loaded from a local file, and you are actively editing that file:

>>> feat0 = nltk.data.load('grammars/feat0.fcfg',cache=False,verbose=True)
<<Loading grammars/feat0.fcfg>>

The cache uses weak references, so a resource wil automatically be expunged from the cache when no more objects are using it. In the following example, when we clear the variable feat0, the reference count for the feature grammar object drops to zero. This causes it to be garbage collected and removed from the cache. Thus, when we load it again, it is no longer in the cache, so nltk.data.load() must load it from the source file.

>>> del feat0
>>> feat0 = nltk.data.load('grammars/feat0.fcfg', verbose=True)
<<Loading grammars/feat0.fcfg>>

You can clear the entire contents of the cache, using nltk.data.clear_cache():

>>> nltk.data.clear_cache()

6 Retrieving other Data Sources

>>> formulas = nltk.data.load('grammars/background1.fol')
>>> for f in formulas: print str(f)
all x.(boxerdog(x) -> dog(x))
all x.(boxer(x) -> person(x))
all x.-(dog(x) & person(x))
all x.(married(x) <-> exists y.marry(x,y))
all x.(bark(x) -> dog(x))
all x.all y.(marry(x,y) -> (person(x) & person(y)))
-(Vincent = Mia)
-(Vincent = Fido)
-(Mia = Fido)

7 Regression Tests

Create a temp dir for tests that write files:

>>> import tempfile, os
>>> tempdir = tempfile.mkdtemp()
>>> old_dir = os.path.abspath('.')
>>> os.chdir(tempdir)

The retrieve() function accepts all url types:

>>> urls = ['http://nltk.org/examples/parse/toy.cfg',
...         'file:%s' % nltk.data.find('grammars/toy.cfg'),
...         'nltk:grammars/toy.cfg',
...         'grammars/toy.cfg']
>>> for i, url in enumerate(urls):
...     nltk.data.retrieve(url, 'toy-%d.cfg' % i) 
Retrieving 'http://nltk.org/examples/parse/toy.cfg', saving to 'toy-0.cfg'
Retrieving 'file:/.../data/grammars/toy.cfg', saving to 'toy-1.cfg'
Retrieving 'nltk:grammars/toy.cfg', saving to 'toy-2.cfg'
Retrieving 'grammars/toy.cfg', saving to 'toy-3.cfg'

Clean up the temp dir:

>>> os.chdir(old_dir)
>>> for f in os.listdir(tempdir):
...     os.remove(os.path.join(tempdir, f))
>>> os.rmdir(tempdir)

Note

load(..., format='yaml') is not currently tested, because we do not have any yaml files in our data distribution yet. We should probably have a trained brill tagger there.

7.1 Lazy Loader

A lazy loader is a wrapper object that defers loading a resource until it is accessed or used in any way. This is mainly intended for internal use by NLTK's corpus readers.

>>> # Create a lazy loader for toy.cfg.
>>> ll = nltk.data.LazyLoader('grammars/toy.cfg')

>>> # Show that it's not loaded yet:
>>> object.__repr__(ll) 
'<nltk.data.LazyLoader object at ...>'

>>> # printing it is enough to cause it to be loaded:
>>> print ll
<Grammar with 14 productions>

>>> # Show that it's now been loaded:
>>> object.__repr__(ll) 
'<nltk.cfg.Grammar object at ...>'

>>> # Test that accessing an attribute also loads it:
>>> ll = nltk.data.LazyLoader('grammars/toy.cfg')
>>> ll.start()
<S>
>>> object.__repr__(ll) 
'<nltk.cfg.Grammar object at ...>'