nltk Package¶
nltk
Package¶
The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. A free online book is available. (If you use the library for academic research, please cite the book.)
Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. http://nltk.org/book
@version: 3.2.4
collocations
Module¶
Tools to identify collocations — words that often appear consecutively — within corpora. They may also be used to find other associations between word occurrences. See Manning and Schutze ch. 5 at http://nlp.stanford.edu/fsnlp/promo/colloc.pdf and the Text::NSP Perl package at http://ngram.sourceforge.net
Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each ngram of words may then be scored according to some association measure, in order to determine the relative likelihood of each ngram being a collocation.
The BigramCollocationFinder
and TrigramCollocationFinder
classes provide
these functionalities, dependent on being provided a function which scores a
ngram given appropriate frequency counts. A number of standard association
measures are provided in bigram_measures and trigram_measures.
-
class
nltk.collocations.
BigramCollocationFinder
(word_fd, bigram_fd, window_size=2)[source]¶ Bases:
nltk.collocations.AbstractCollocationFinder
A tool for the finding and ranking of bigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.
-
default_ws
= 2¶
-
-
class
nltk.collocations.
TrigramCollocationFinder
(word_fd, bigram_fd, wildcard_fd, trigram_fd)[source]¶ Bases:
nltk.collocations.AbstractCollocationFinder
A tool for the finding and ranking of trigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.
-
bigram_finder
()[source]¶ Constructs a bigram collocation finder with the bigram and unigram data from this finder. Note that this does not include any filtering applied to this finder.
-
default_ws
= 3¶
-
-
class
nltk.collocations.
QuadgramCollocationFinder
(word_fd, quadgram_fd, ii, iii, ixi, ixxi, iixi, ixii)[source]¶ Bases:
nltk.collocations.AbstractCollocationFinder
A tool for the finding and ranking of quadgram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.
-
default_ws
= 4¶
-
data
Module¶
Functions to find and load NLTK resource files, such as corpora,
grammars, and saved processing objects. Resource files are identified
using URLs, such as nltk:corpora/abc/rural.txt
or
http://nltk.org/sample/toy.cfg
. The following URL protocols are
supported:
file:path
: Specifies the file whose path is path. Both relative and absolute paths may be used.http://host/path
: Specifies the file stored on the web server host at path path.nltk:path
: Specifies the file stored in the NLTK data package at path. NLTK will search for these files in the directories specified bynltk.data.path
.
If no protocol is specified, then the default protocol nltk:
will
be used.
This module provides to functions that can be used to access a
resource file, given its URL: load()
loads a given resource, and
adds it to a resource cache; and retrieve()
copies a given resource
to a local file.
-
nltk.data.
path
= ['/Users/sb/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']¶ A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk_data).
-
class
nltk.data.
PathPointer
[source]¶ Bases:
object
An abstract base class for ‘path pointers,’ used by NLTK’s data package to identify specific paths. Two subclasses exist:
FileSystemPathPointer
identifies a file that can be accessed directly via a given absolute path.ZipFilePathPointer
identifies a file contained within a zipfile, that can be accessed by reading that zipfile.-
file_size
()[source]¶ Return the size of the file pointed to by this path pointer, in bytes.
Raises: IOError – If the path specified by this pointer does not contain a readable file.
-
-
class
nltk.data.
FileSystemPathPointer
(_path)[source]¶ Bases:
nltk.data.PathPointer
,str
A path pointer that identifies a file which can be accessed directly via a given absolute path.
-
path
¶ The absolute path identified by this path pointer.
-
-
class
nltk.data.
BufferedGzipFile
(filename=None, mode=None, compresslevel=9, fileobj=None, **kwargs)[source]¶ Bases:
gzip.GzipFile
A
GzipFile
subclass that buffers calls toread()
andwrite()
. This allows faster reads and writes of data to and from gzip-compressed files at the cost of using more memory.The default buffer size is 2MB.
BufferedGzipFile
is useful for loading large gzipped pickle objects as well as writing large encoded feature files for classifier training.-
MB
= 1048576¶
-
SIZE
= 2097152¶
-
-
class
nltk.data.
GzipFileSystemPathPointer
(_path)[source]¶ Bases:
nltk.data.FileSystemPathPointer
A subclass of
FileSystemPathPointer
that identifies a gzip-compressed file located at a given absolute path.GzipFileSystemPathPointer
is appropriate for loading large gzip-compressed pickle objects efficiently.
-
class
nltk.data.
GzipFileSystemPathPointer
(_path)[source] Bases:
nltk.data.FileSystemPathPointer
A subclass of
FileSystemPathPointer
that identifies a gzip-compressed file located at a given absolute path.GzipFileSystemPathPointer
is appropriate for loading large gzip-compressed pickle objects efficiently.-
open
(encoding=None)[source]
-
-
nltk.data.
find
(resource_name, paths=None)[source]¶ Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a
LookupError
, whose message gives a pointer to the installation instructions for the NLTK downloader.Zip File Handling:
- If
resource_name
contains a component with a.zip
extension, then it is assumed to be a zipfile; and the remaining path components are used to look inside the zipfile. - If any element of
nltk.data.path
has a.zip
extension, then it is assumed to be a zipfile. - If a given resource name that does not contain any zipfile
component is not found initially, then
find()
will make a second attempt to find that resource, by replacing each component p in the path with p.zip/p. For example, this allowsfind()
to map the resource namecorpora/chat80/cities.pl
to a zip file path pointer tocorpora/chat80.zip/chat80/cities.pl
. - When using
find()
to locate a directory contained in a zipfile, the resource name must end with the forward slash character. Otherwise,find()
will not locate the directory.
Parameters: resource_name (str or unicode) – The name of the resource to search for. Resource names are posix-style relative path names, such as corpora/brown
. Directory names will be automatically converted to a platform-appropriate path separator.Return type: str - If
-
nltk.data.
retrieve
(resource_url, filename=None, verbose=True)[source]¶ Copy the given resource to a local file. If no filename is specified, then use the URL’s filename. If there is already a file named
filename
, then raise aValueError
.Parameters: resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.
-
nltk.data.
FORMATS
= {'fcfg': 'A feature CFG.', 'pickle': 'A serialized python object, stored using the pickle module.', 'json': 'A serialized python object, stored using the json module.', 'cfg': 'A context free grammar.', 'val': 'A semantic valuation, parsed by nltk.sem.Valuation.fromstring.', 'logic': 'A list of first order logic expressions, parsed with nltk.sem.logic.LogicParser. Requires an additional logic_parser parameter', 'yaml': 'A serialized python object, stored using the yaml module.', 'fol': 'A list of first order logic expressions, parsed with nltk.sem.logic.Expression.fromstring.', 'text': 'The raw (unicode string) contents of a file. ', 'raw': 'The raw (byte string) contents of a file.', 'pcfg': 'A probabilistic CFG.'}¶ A dictionary describing the formats that are supported by NLTK’s load() method. Keys are format names, and values are format descriptions.
-
nltk.data.
AUTO_FORMATS
= {'fcfg': 'fcfg', 'pickle': 'pickle', 'txt': 'text', 'json': 'json', 'cfg': 'cfg', 'val': 'val', 'logic': 'logic', 'yaml': 'yaml', 'fol': 'fol', 'text': 'text', 'pcfg': 'pcfg'}¶ A dictionary mapping from file extensions to format names, used by load() when format=”auto” to decide the format for a given resource url.
-
nltk.data.
load
(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None)[source]¶ Load a given resource from the NLTK data package. The following resource formats are currently supported:
pickle
json
yaml
cfg
(context free grammars)pcfg
(probabilistic CFGs)fcfg
(feature-based CFGs)fol
(formulas of First Order Logic)logic
(Logical formulas to be parsed by the given logic_parser)val
(valuation of First Order Logic model)text
(the file contents as a unicode string)raw
(the raw file contents as a byte string)
If no format is specified,
load()
will attempt to determine a format based on the resource name’s file extension. If that fails,load()
will raise aValueError
exception.For all text formats (everything except
pickle
,json
,yaml
andraw
), it tries to decode the raw contents using UTF-8, and if that doesn’t work, it tries with ISO-8859-1 (Latin-1), unless theencoding
is specified.Parameters: - resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.
- cache (bool) – If true, add this resource to a cache. If load() finds a resource in its cache, then it will return it from the cache rather than loading it. The cache uses weak references, so a resource wil automatically be expunged from the cache when no more objects are using it.
- verbose (bool) – If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache.
- logic_parser (LogicParser) – The parser that will be used to parse logical expressions.
- fstruct_reader (FeatStructReader) – The parser that will be used to parse the feature structure of an fcfg.
- encoding (str) – the encoding of the input; only used for text formats.
-
nltk.data.
show_cfg
(resource_url, escape='##')[source]¶ Write out a grammar file, ignoring escaped and empty lines.
Parameters:
-
class
nltk.data.
OpenOnDemandZipFile
(filename)[source]¶ Bases:
zipfile.ZipFile
A subclass of
zipfile.ZipFile
that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile. This is useful for reducing the number of open file handles when many zip files are being accessed at once.OpenOnDemandZipFile
must be constructed from a filename, not a file-like object (to allow re-opening).OpenOnDemandZipFile
is read-only (i.e.write()
andwritestr()
are disabled.
-
class
nltk.data.
GzipFileSystemPathPointer
(_path)[source] Bases:
nltk.data.FileSystemPathPointer
A subclass of
FileSystemPathPointer
that identifies a gzip-compressed file located at a given absolute path.GzipFileSystemPathPointer
is appropriate for loading large gzip-compressed pickle objects efficiently.-
open
(encoding=None)[source]
-
-
class
nltk.data.
SeekableUnicodeStreamReader
(stream, encoding, errors='strict')[source]¶ Bases:
object
A stream reader that automatically encodes the source byte stream into unicode (like
codecs.StreamReader
); but still supports theseek()
andtell()
operations correctly. This is in contrast tocodecs.StreamReader
, which provide brokenseek()
andtell()
methods.This class was motivated by
StreamBackedCorpusView
, which makes extensive use ofseek()
andtell()
, and needs to be able to handle unicode-encoded files.Note: this class requires stateless decoders. To my knowledge, this shouldn’t cause a problem with any of python’s builtin unicode encodings.
-
DEBUG
= True¶
-
bytebuffer
= None¶ A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.
-
closed
¶ True if the underlying stream is closed.
-
decode
= None¶ The function that is used to decode byte strings into unicode strings.
-
encoding
= None¶ The name of the encoding that should be used to encode the underlying stream.
-
errors
= None¶ The error mode that should be used when decoding data from the underlying stream. Can be ‘strict’, ‘ignore’, or ‘replace’.
-
linebuffer
= None¶ A buffer used by
readline()
to hold characters that have been read, but have not yet been returned byread()
orreadline()
. This buffer consists of a list of unicode strings, where each string corresponds to a single line. The final element of the list may or may not be a complete line. Note that the existence of a linebuffer makes thetell()
operation more complex, because it must backtrack to the beginning of the buffer to determine the correct file position in the underlying byte stream.
-
mode
¶ The mode of the underlying stream.
-
name
¶ The name of the underlying stream.
-
read
(size=None)[source]¶ Read up to
size
bytes, decode them using this reader’s encoding, and return the resulting unicode string.Parameters: size (int) – The maximum number of bytes to read. If not specified, then read as many bytes as possible. Return type: unicode
-
readline
(size=None)[source]¶ Read a line of text, decode it using this reader’s encoding, and return the resulting unicode string.
Parameters: size (int) – The maximum number of bytes to read. If no newline is encountered before size
bytes have been read, then the returned value may not be a complete line of text.
-
readlines
(sizehint=None, keepends=True)[source]¶ Read this file’s contents, decode them using this reader’s encoding, and return it as a list of unicode lines.
Return type: list(unicode)
Parameters: - sizehint – Ignored.
- keepends – If false, then strip newlines.
-
seek
(offset, whence=0)[source]¶ Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.
Parameters: - offset – A byte count offset.
- whence – If 0, then the offset is from the start of the file (offset should be positive), if 1, then the offset is from the current position (offset may be positive or negative); and if 2, then the offset is from the end of the file (offset should typically be negative).
-
stream
= None¶ The underlying stream.
-
downloader
Module¶
The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.
Downloading Packages¶
If called with no arguments, download()
will display an interactive
interface which can be used to download and install new packages.
If Tkinter is available, then a graphical interface will be shown,
otherwise a simple text interface will be provided.
Individual packages can be downloaded by calling the download()
function with a single argument, giving the package identifier for the
package that should be downloaded:
>>> download('treebank')
[nltk_data] Downloading package 'treebank'...
[nltk_data] Unzipping corpora/treebank.zip.
NLTK also provides a number of “package collections”, consisting of
a group of related packages. To download all packages in a
colleciton, simply call download()
with the collection’s
identifier:
>>> download('all-corpora')
[nltk_data] Downloading package 'abc'...
[nltk_data] Unzipping corpora/abc.zip.
[nltk_data] Downloading package 'alpino'...
[nltk_data] Unzipping corpora/alpino.zip.
...
[nltk_data] Downloading package 'words'...
[nltk_data] Unzipping corpora/words.zip.
Download Directory¶
By default, packages are installed in either a system-wide directory
(if Python has sufficient access to write to it); or in the current
user’s home directory. However, the download_dir
argument may be
used to specify a different installation target, if desired.
See Downloader.default_download_dir()
for more a detailed
description of how the default download directory is chosen.
NLTK Download Server¶
Before downloading any packages, the corpus and module downloader
contacts the NLTK download server, to retrieve an index file
describing the available packages. By default, this index file is
loaded from https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
.
If necessary, it is possible to create a new Downloader
object,
specifying a different URL for the package index file.
Usage:
python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
or:
python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
-
class
nltk.downloader.
Collection
(id, children, name=None, **kw)[source]¶ Bases:
object
A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by
Downloader
.-
children
= None¶ A list of the
Collections
orPackages
directly contained by this collection.
-
id
= None¶ A unique identifier for this collection.
-
name
= None¶ A string name for this collection.
-
packages
= None¶ A list of
Packages
contained by this collection or any collections it recursively contains.
-
unicode_repr
()¶
-
-
class
nltk.downloader.
Downloader
(server_index_url=None, download_dir=None)[source]¶ Bases:
object
A class used to access the NLTK data server, which can be used to download corpora and other data packages.
-
DEFAULT_URL
= 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml'¶ The default URL for the NLTK data server’s index. An alternative URL can be specified when creating a new
Downloader
object.
-
INDEX_TIMEOUT
= 3600¶ The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.
-
INSTALLED
= 'installed'¶ A status string indicating that a package or collection is installed and up-to-date.
-
NOT_INSTALLED
= 'not installed'¶ A status string indicating that a package or collection is not installed.
-
PARTIAL
= 'partial'¶ A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)
-
STALE
= 'out of date'¶ A status string indicating that a package or collection is corrupt or out-of-date.
-
default_download_dir
()[source]¶ Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the
download_dir
argument when callingdownload()
.On Windows, the default download directory is
PYTHONHOME/lib/nltk
, where PYTHONHOME is the directory containing Python, e.g.C:\Python25
.On all other platforms, the default directory is the first of the following which exists or which can be created with write permission:
/usr/share/nltk_data
,/usr/local/share/nltk_data
,/usr/lib/nltk_data
,/usr/local/lib/nltk_data
,~/nltk_data
.
-
download
(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False)[source]¶
-
download_dir
¶ The default directory to which packages will be downloaded. This defaults to the value returned by
default_download_dir()
. To override this default on a case-by-case basis, use thedownload_dir
argument when callingdownload()
.
-
index
()[source]¶ Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.
-
list
(download_dir=None, show_packages=True, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]¶
-
status
(info_or_id, download_dir=None)[source]¶ Return a constant describing the status of the given package or collection. Status can be one of
INSTALLED
,NOT_INSTALLED
,STALE
, orPARTIAL
.
-
url
¶ The URL for the data server’s index file.
-
-
class
nltk.downloader.
DownloaderGUI
(dataserver, use_threads=True)[source]¶ Bases:
object
Graphical interface for downloading packages from the NLTK data server.
-
COLUMNS
= ['', 'Identifier', 'Name', 'Size', 'Status', 'Unzipped Size', 'Copyright', 'Contact', 'License', 'Author', 'Subdir', 'Checksum']¶ A list of the names of columns. This controls the order in which the columns will appear. If this is edited, then
_package_to_columns()
may need to be edited to match.
-
COLUMN_WEIGHTS
= {'': 0, 'Size': 0, 'Status': 0, 'Name': 5}¶ A dictionary specifying how columns should be resized when the table is resized. Columns with weight 0 will not be resized at all; and columns with high weight will be resized more. Default weight (for columns not explicitly listed) is 1.
-
COLUMN_WIDTHS
= {'': 1, 'Size': 10, 'Identifier': 20, 'Status': 12, 'Unzipped Size': 10, 'Name': 45}¶ A dictionary specifying how wide each column should be, in characters. The default width (for columns not explicitly listed) is specified by
DEFAULT_COLUMN_WIDTH
.
-
DEFAULT_COLUMN_WIDTH
= 30¶ The default width for columns that are not explicitly listed in
COLUMN_WIDTHS
.
-
HELP
= 'This tool can be used to download a variety of corpora and models\nthat can be used with NLTK. Each corpus or model is distributed\nin a single zip file, known as a "package file." You can\ndownload packages individually, or you can download pre-defined\ncollections of packages.\n\nWhen you download a package, it will be saved to the "download\ndirectory." A default download directory is chosen when you run\n\nthe downloader; but you may also select a different download\ndirectory. On Windows, the default download directory is\n\n\n"package."\n\nThe NLTK downloader can be used to download a variety of corpora,\nmodels, and other data packages.\n\nKeyboard shortcuts::\n [return]\t Download\n [up]\t Select previous package\n [down]\t Select next package\n [left]\t Select previous tab\n [right]\t Select next tab\n'¶
-
INITIAL_COLUMNS
= ['', 'Identifier', 'Name', 'Size', 'Status']¶ The set of columns that should be displayed by default.
-
c
= 'Status'¶
-
-
class
nltk.downloader.
DownloaderMessage
[source]¶ Bases:
object
A status message object, used by
incr_download
to communicate its progress.
-
class
nltk.downloader.
ErrorMessage
(package, message)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server encountered an error
-
class
nltk.downloader.
FinishCollectionMessage
(collection)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has finished working on a collection of packages.
-
class
nltk.downloader.
FinishDownloadMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has finished downloading a package.
-
class
nltk.downloader.
FinishPackageMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has finished working on a package.
-
class
nltk.downloader.
FinishUnzipMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has finished unzipping a package.
-
class
nltk.downloader.
Package
(id, url, name=None, subdir='', size=None, unzipped_size=None, checksum=None, svn_revision=None, copyright='Unknown', contact='Unknown', license='Unknown', author='Unknown', unzip=True, **kw)[source]¶ Bases:
object
A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by
Downloader
. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.Author of this package.
-
checksum
= None¶ The MD-5 checksum of the package file.
-
contact
= None¶ Name & email of the person who should be contacted with questions about this package.
-
copyright
= None¶ Copyright holder for this package.
-
filename
= None¶ The filename that should be used for this package’s file. It is formed by joining
self.subdir
withself.id
, and using the same extension asurl
.
-
id
= None¶ A unique identifier for this package.
-
license
= None¶ License information for this package.
-
name
= None¶ A string name for this package.
-
size
= None¶ The filesize (in bytes) of the package file.
-
subdir
= None¶ The subdirectory where this package should be installed. E.g.,
'corpora'
or'taggers'
.
-
svn_revision
= None¶ A subversion revision number for this package.
-
unicode_repr
()¶
-
unzip
= None¶ A flag indicating whether this corpus should be unzipped by default.
-
unzipped_size
= None¶ The total filesize of the files contained in the package’s zipfile.
-
url
= None¶ A URL that can be used to download this package’s file.
-
class
nltk.downloader.
ProgressMessage
(progress)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Indicates how much progress the data server has made
-
class
nltk.downloader.
SelectDownloadDirMessage
(download_dir)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Indicates what download directory the data server is using
-
class
nltk.downloader.
StaleMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
The package download file is out-of-date or corrupt
-
class
nltk.downloader.
StartCollectionMessage
(collection)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has started working on a collection of packages.
-
class
nltk.downloader.
StartDownloadMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has started downloading a package.
-
class
nltk.downloader.
StartPackageMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has started working on a package.
-
class
nltk.downloader.
StartUnzipMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has started unzipping a package.
-
class
nltk.downloader.
UpToDateMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
The package download file is already up-to-date
-
nltk.downloader.
build_index
(root, base_url)[source]¶ Create a new data.xml index file, by combining the xml description files for various packages and collections.
root
should be the path to a directory containing the package xml and zip files; and the collection xml files. Theroot
directory is expected to have the following subdirectories:root/ packages/ .................. subdirectory for packages corpora/ ................. zip & xml files for corpora grammars/ ................ zip & xml files for grammars taggers/ ................. zip & xml files for taggers tokenizers/ .............. zip & xml files for tokenizers etc. collections/ ............... xml files for collections
For each package, there should be two files:
package.zip
(where package is the package name) which contains the package itself as a compressed zip file; andpackage.xml
, which is an xml description of the package. The zipfilepackage.zip
should expand to a single subdirectory namedpackage/
. The base filenamepackage
must match the identifier given in the package’s xml file.For each collection, there should be a single file
collection.zip
describing the collection, where collection is the name of the collection.All identifiers (for both packages and collections) must be unique.
-
nltk.downloader.
md5_hexdigest
(file)[source]¶ Calculate and return the MD5 checksum for a given file.
file
may either be a filename or an open stream.
featstruct
Module¶
Basic data classes for representing feature structures, and for
performing basic operations on those feature structures. A feature
structure is a mapping from feature identifiers to feature values,
where each feature value is either a basic value (such as a string or
an integer), or a nested feature structure. There are two types of
feature structure, implemented by two subclasses of FeatStruct
:
- feature dictionaries, implemented by
FeatDict
, act like Python dictionaries. Feature identifiers may be strings or instances of theFeature
class.- feature lists, implemented by
FeatList
, act like Python lists. Feature identifiers are integers.
Feature structures are typically used to represent partial information about objects. A feature identifier that is not mapped to a value stands for a feature whose value is unknown (not a feature without a value). Two feature structures that represent (potentially overlapping) information about the same object can be combined by unification. When two inconsistent feature structures are unified, the unification fails and returns None.
Features can be specified using “feature paths”, or tuples of feature identifiers that specify path through the nested feature structures to a value. Feature structures may contain reentrant feature values. A “reentrant feature value” is a single feature value that can be accessed via multiple feature paths. Unification preserves the reentrance relations imposed by both of the unified feature structures. In the feature structure resulting from unification, any modifications to a reentrant feature value will be visible using any of its feature paths.
Feature structure variables are encoded using the nltk.sem.Variable
class. The variables’ values are tracked using a bindings
dictionary, which maps variables to their values. When two feature
structures are unified, a fresh bindings dictionary is created to
track their values; and before unification completes, all bound
variables are replaced by their values. Thus, the bindings
dictionaries are usually strictly internal to the unification process.
However, it is possible to track the bindings of variables if you
choose to, by supplying your own initial bindings dictionary to the
unify()
function.
When unbound variables are unified with one another, they become aliased. This is encoded by binding one variable to the other.
Lightweight Feature Structures¶
Many of the functions defined by nltk.featstruct
can be applied
directly to simple Python dictionaries and lists, rather than to
full-fledged FeatDict
and FeatList
objects. In other words,
Python dicts
and lists
can be used as “light-weight” feature
structures.
>>> from nltk.featstruct import unify
>>> unify(dict(x=1, y=dict()), dict(a='a', y=dict(b='b')))
{'y': {'b': 'b'}, 'x': 1, 'a': 'a'}
However, you should keep in mind the following caveats:
- Python dictionaries & lists ignore reentrance when checking for equality between values. But two FeatStructs with different reentrances are considered nonequal, even if all their base values are equal.
- FeatStructs can be easily frozen, allowing them to be used as keys in hash tables. Python dictionaries and lists can not.
- FeatStructs display reentrance in their string representations; Python dictionaries and lists do not.
- FeatStructs may not be mixed with Python dictionaries and lists (e.g., when performing unification).
- FeatStructs provide a number of useful methods, such as
walk()
andcyclic()
, which are not available for Python dicts and lists.
In general, if your feature structures will contain any reentrances,
or if you plan to use them as dictionary keys, it is strongly
recommended that you use full-fledged FeatStruct
objects.
-
class
nltk.featstruct.
FeatStruct
[source]¶ Bases:
nltk.sem.logic.SubstituteBindingsI
A mapping from feature identifiers to feature values, where each feature value is either a basic value (such as a string or an integer), or a nested feature structure. There are two types of feature structure:
- feature dictionaries, implemented by
FeatDict
, act like Python dictionaries. Feature identifiers may be strings or instances of theFeature
class. - feature lists, implemented by
FeatList
, act like Python lists. Feature identifiers are integers.
Feature structures may be indexed using either simple feature identifiers or ‘feature paths.’ A feature path is a sequence of feature identifiers that stand for a corresponding sequence of indexing operations. In particular,
fstruct[(f1,f2,...,fn)]
is equivalent tofstruct[f1][f2]...[fn]
.Feature structures may contain reentrant feature structures. A “reentrant feature structure” is a single feature structure object that can be accessed via multiple feature paths. Feature structures may also be cyclic. A feature structure is “cyclic” if there is any feature path from the feature structure to itself.
Two feature structures are considered equal if they assign the same values to all features, and have the same reentrancies.
By default, feature structures are mutable. They may be made immutable with the
freeze()
method. Once they have been frozen, they may be hashed, and thus used as dictionary keys.-
copy
(deep=True)[source]¶ Return a new copy of
self
. The new copy will not be frozen.Parameters: deep – If true, create a deep copy; if false, create a shallow copy.
-
equal_values
(other, check_reentrance=False)[source]¶ Return True if
self
andother
assign the same value to to every feature. In particular, return true ifself[p]==other[p]
for every feature path p such thatself[p]
orother[p]
is a base value (i.e., not a nested feature structure).Parameters: check_reentrance – If True, then also return False if there is any difference between the reentrances of self
andother
.Note: the ==
is equivalent toequal_values()
withcheck_reentrance=True
.
-
freeze
()[source]¶ Make this feature structure, and any feature structures it contains, immutable. Note: this method does not attempt to ‘freeze’ any feature value that is not a
FeatStruct
; it is recommended that you use only immutable feature values.
-
frozen
()[source]¶ Return True if this feature structure is immutable. Feature structures can be made immutable with the
freeze()
method. Immutable feature structures may not be made mutable again, but new mutable copies can be produced with thecopy()
method.
-
remove_variables
()[source]¶ Return the feature structure that is obtained by deleting any feature whose value is a
Variable
.Return type: FeatStruct
-
rename_variables
(vars=None, used_vars=(), new_vars=None)[source]¶ See: nltk.featstruct.rename_variables()
- feature dictionaries, implemented by
-
class
nltk.featstruct.
FeatDict
(features=None, **morefeatures)[source]¶ Bases:
nltk.featstruct.FeatStruct
,dict
A feature structure that acts like a Python dictionary. I.e., a mapping from feature identifiers to feature values, where a feature identifier can be a string or a
Feature
; and where a feature value can be either a basic value (such as a string or an integer), or a nested feature structure. A feature identifiers for aFeatDict
is sometimes called a “feature name”.Two feature dicts are considered equal if they assign the same values to all features, and have the same reentrances.
See: FeatStruct
for information about feature paths, reentrance, cyclic feature structures, mutability, freezing, and hashing.-
clear
() → None. Remove all items from D.¶ If self is frozen, raise ValueError.
-
get
(name_or_path, default=None)[source]¶ If the feature with the given name or path exists, return its value; otherwise, return
default
.
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised If self is frozen, raise ValueError.
-
popitem
() → (k, v), remove and return some (key, value) pair as a¶ 2-tuple; but raise KeyError if D is empty. If self is frozen, raise ValueError.
-
setdefault
(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶ If self is frozen, raise ValueError.
-
unicode_repr
()¶ Display a single-line representation of this feature structure, suitable for embedding in other representations.
-
-
class
nltk.featstruct.
FeatList
(features=())[source]¶ Bases:
nltk.featstruct.FeatStruct
,list
A list of feature values, where each feature value is either a basic value (such as a string or an integer), or a nested feature structure.
Feature lists may contain reentrant feature values. A “reentrant feature value” is a single feature value that can be accessed via multiple feature paths. Feature lists may also be cyclic.
Two feature lists are considered equal if they assign the same values to all features, and have the same reentrances.
See: FeatStruct
for information about feature paths, reentrance, cyclic feature structures, mutability, freezing, and hashing.-
append
(object) → None -- append object to end¶ If self is frozen, raise ValueError.
-
extend
(iterable) → None -- extend list by appending elements from the iterable¶ If self is frozen, raise ValueError.
-
insert
(*args, **kwargs)¶ L.insert(index, object) – insert object before index If self is frozen, raise ValueError.
-
pop
([index]) → item -- remove and return item at index (default last).¶ Raises IndexError if list is empty or index is out of range. If self is frozen, raise ValueError.
-
remove
(value) → None -- remove first occurrence of value.¶ Raises ValueError if the value is not present. If self is frozen, raise ValueError.
-
reverse
(*args, **kwargs)¶ L.reverse() – reverse IN PLACE If self is frozen, raise ValueError.
-
sort
(key=None, reverse=False) → None -- stable sort *IN PLACE*¶ If self is frozen, raise ValueError.
-
-
nltk.featstruct.
unify
(fstruct1, fstruct2, bindings=None, trace=False, fail=None, rename_vars=True, fs_class='default')[source]¶ Unify
fstruct1
withfstruct2
, and return the resulting feature structure. This unified feature structure is the minimal feature structure that contains all feature value assignments from bothfstruct1
andfstruct2
, and that preserves all reentrancies.If no such feature structure exists (because
fstruct1
andfstruct2
specify incompatible values for some feature), then unification fails, andunify
returns None.Bound variables are replaced by their values. Aliased variables are replaced by their representative variable (if unbound) or the value of their representative variable (if bound). I.e., if variable v is in
bindings
, then v is replaced bybindings[v]
. This will be repeated until the variable is replaced by an unbound variable or a non-variable value.Unbound variables are bound when they are unified with values; and aliased when they are unified with variables. I.e., if variable v is not in
bindings
, and is unified with a variable or value x, thenbindings[v]
is set to x.If
bindings
is unspecified, then all variables are assumed to be unbound. I.e.,bindings
defaults to an empty dict.>>> from nltk.featstruct import FeatStruct >>> FeatStruct('[a=?x]').unify(FeatStruct('[b=?x]')) [a=?x, b=?x2]
Parameters: - bindings (dict(Variable -> any)) – A set of variable bindings to be used and updated during unification.
- trace (bool) – If true, generate trace output.
- rename_vars (bool) – If True, then rename any variables in
fstruct2
that are also used infstruct1
, in order to avoid collisions on variable names.
-
nltk.featstruct.
subsumes
(fstruct1, fstruct2)[source]¶ Return True if
fstruct1
subsumesfstruct2
. I.e., return true if unifyingfstruct1
withfstruct2
would result in a feature structure equal tofstruct2.
Return type: bool
-
nltk.featstruct.
conflicts
(fstruct1, fstruct2, trace=0)[source]¶ Return a list of the feature paths of all features which are assigned incompatible values by
fstruct1
andfstruct2
.Return type: list(tuple)
-
class
nltk.featstruct.
Feature
(name, default=None, display=None)[source]¶ Bases:
object
A feature identifier that’s specialized to put additional constraints, default values, etc.
-
default
¶ Default value for this feature.
-
display
¶ Custom display location: can be prefix, or slash.
-
name
¶ The name of this feature.
-
unicode_repr
()¶
-
-
class
nltk.featstruct.
SlashFeature
(name, default=None, display=None)[source]¶ Bases:
nltk.featstruct.Feature
-
class
nltk.featstruct.
RangeFeature
(name, default=None, display=None)[source]¶ Bases:
nltk.featstruct.Feature
-
RANGE_RE
= re.compile('(-?\\d+):(-?\\d+)')¶
-
-
class
nltk.featstruct.
FeatStructReader
(features=(*slash*, *type*), fdict_class=<class 'nltk.featstruct.FeatStruct'>, flist_class=<class 'nltk.featstruct.FeatList'>, logic_parser=None)[source]¶ Bases:
object
-
VALUE_HANDLERS
= [('read_fstruct_value', re.compile('\\s*(?:\\((\\d+)\\)\\s*)?(\\??[\\w-]+)?(\\[)')), ('read_var_value', re.compile('\\?[a-zA-Z_][a-zA-Z0-9_]*')), ('read_str_value', re.compile('[uU]?[rR]?([\'"])')), ('read_int_value', re.compile('-?\\d+')), ('read_sym_value', re.compile('[a-zA-Z_][a-zA-Z0-9_]*')), ('read_app_value', re.compile('<(app)\\((\\?[a-z][a-z]*)\\s*,\\s*(\\?[a-z][a-z]*)\\)>')), ('read_logic_value', re.compile('<(.*?)(?<!-)>')), ('read_set_value', re.compile('{')), ('read_tuple_value', re.compile('\\('))]¶ A table indicating how feature values should be processed. Each entry in the table is a pair (handler, regexp). The first entry with a matching regexp will have its handler called. Handlers should have the following signature:
def handler(s, position, reentrances, match): ...
and should return a tuple (value, position), where position is the string position where the value ended. (n.b.: order is important here!)
-
fromstring
(s, fstruct=None)[source]¶ Convert a string representation of a feature structure (as displayed by repr) into a
FeatStruct
. This process imposes the following restrictions on the string representation:- Feature names cannot contain any of the following: whitespace, parentheses, quote marks, equals signs, dashes, commas, and square brackets. Feature names may not begin with plus signs or minus signs.
- Only the following basic feature value are supported: strings, integers, variables, None, and unquoted alphanumeric strings.
- For reentrant values, the first mention must specify
a reentrance identifier and a value; and any subsequent
mentions must use arrows (
'->'
) to reference the reentrance identifier.
-
read_partial
(s, position=0, reentrances=None, fstruct=None)[source]¶ Helper function that reads in a feature structure.
Parameters: - s – The string to read.
- position – The position in the string to start parsing.
- reentrances – A dictionary from reentrance ids to values. Defaults to an empty dictionary.
Returns: A tuple (val, pos) of the feature structure created by parsing and the position where the parsed feature structure ends.
Return type: bool
-
grammar
Module¶
Basic data classes for representing context free grammars. A
“grammar” specifies which trees can represent the structure of a
given text. Each of these trees is called a “parse tree” for the
text (or simply a “parse”). In a “context free” grammar, the set of
parse trees for any piece of a text can depend only on that piece, and
not on the rest of the text (i.e., the piece’s context). Context free
grammars are often used to find possible syntactic structures for
sentences. In this context, the leaves of a parse tree are word
tokens; and the node values are phrasal categories, such as NP
and VP
.
The CFG
class is used to encode context free grammars. Each
CFG
consists of a start symbol and a set of productions.
The “start symbol” specifies the root node value for parse trees. For example,
the start symbol for syntactic parsing is usually S
. Start
symbols are encoded using the Nonterminal
class, which is discussed
below.
A Grammar’s “productions” specify what parent-child relationships a parse
tree can contain. Each production specifies that a particular
node can be the parent of a particular set of children. For example,
the production <S> -> <NP> <VP>
specifies that an S
node can
be the parent of an NP
node and a VP
node.
Grammar productions are implemented by the Production
class.
Each Production
consists of a left hand side and a right hand
side. The “left hand side” is a Nonterminal
that specifies the
node type for a potential parent; and the “right hand side” is a list
that specifies allowable children for that parent. This lists
consists of Nonterminals
and text types: each Nonterminal
indicates that the corresponding child may be a TreeToken
with the
specified node type; and each text type indicates that the
corresponding child may be a Token
with the with that type.
The Nonterminal
class is used to distinguish node values from leaf
values. This prevents the grammar from accidentally using a leaf
value (such as the English word “A”) as the node of a subtree. Within
a CFG
, all node values are wrapped in the Nonterminal
class. Note, however, that the trees that are specified by the grammar do
not include these Nonterminal
wrappers.
Grammars can also be given a more procedural interpretation. According to this interpretation, a Grammar specifies any tree structure tree that can be produced by the following procedure:
The operation of replacing the left hand side (lhs) of a production with the right hand side (rhs) in a tree (tree) is known as “expanding” lhs to rhs in tree.
-
class
nltk.grammar.
Nonterminal
(symbol)[source]¶ Bases:
object
A non-terminal symbol for a context free grammar.
Nonterminal
is a wrapper class for node values; it is used byProduction
objects to distinguish node values from leaf values. The node value that is wrapped by aNonterminal
is known as its “symbol”. Symbols are typically strings representing phrasal categories (such as"NP"
or"VP"
). However, more complex symbol types are sometimes used (e.g., for lexicalized grammars). Since symbols are node values, they must be immutable and hashable. TwoNonterminals
are considered equal if their symbols are equal.See: CFG
,Production
Variables: _symbol – The node value corresponding to this Nonterminal
. This value must be immutable and hashable.
-
nltk.grammar.
nonterminals
(symbols)[source]¶ Given a string containing a list of symbol names, return a list of
Nonterminals
constructed from those symbols.Parameters: symbols (str) – The symbol name string. This string can be delimited by either spaces or commas. Returns: A list of Nonterminals
constructed from the symbol names given insymbols
. TheNonterminals
are sorted in the same order as the symbols names.Return type: list(Nonterminal)
-
class
nltk.grammar.
CFG
(start, productions, calculate_leftcorners=True)[source]¶ Bases:
object
A context-free grammar. A grammar consists of a start state and a set of productions. The set of terminals and nonterminals is implicitly specified by the productions.
If you need efficient key-based access to productions, you can use a subclass to implement it.
-
check_coverage
(tokens)[source]¶ Check whether the grammar rules cover the given list of tokens. If not, then raise an exception.
-
classmethod
fromstring
(input, encoding=None)[source]¶ Return the
CFG
corresponding to the input string(s).Parameters: input – a grammar, either in the form of a string or as a list of strings.
-
is_binarised
()[source]¶ Return True if all productions are at most binary. Note that there can still be empty and unary productions.
-
is_chomsky_normal_form
()[source]¶ Return True if the grammar is of Chomsky Normal Form, i.e. all productions are of the form A -> B C, or A -> “s”.
-
is_flexible_chomsky_normal_form
()[source]¶ Return True if all productions are of the forms A -> B C, A -> B, or A -> “s”.
-
is_leftcorner
(cat, left)[source]¶ True if left is a leftcorner of cat, where left can be a terminal or a nonterminal.
Parameters: - cat (Nonterminal) – the parent of the leftcorner
- left (Terminal or Nonterminal) – the suggested leftcorner
Return type: bool
-
is_nonlexical
()[source]¶ Return True if all lexical rules are “preterminals”, that is, unary rules which can be separated in a preprocessing step.
This means that all productions are of the forms A -> B1 ... Bn (n>=0), or A -> “s”.
Note: is_lexical() and is_nonlexical() are not opposites. There are grammars which are neither, and grammars which are both.
-
leftcorner_parents
(cat)[source]¶ Return the set of all nonterminals for which the given category is a left corner. This is the inverse of the leftcorner relation.
Parameters: cat (Nonterminal) – the suggested leftcorner Returns: the set of all parents to the leftcorner Return type: set(Nonterminal)
-
leftcorners
(cat)[source]¶ Return the set of all nonterminals that the given nonterminal can start with, including itself.
This is the reflexive, transitive closure of the immediate leftcorner relation: (A > B) iff (A -> B beta)
Parameters: cat (Nonterminal) – the parent of the leftcorners Returns: the set of all leftcorners Return type: set(Nonterminal)
-
productions
(lhs=None, rhs=None, empty=False)[source]¶ Return the grammar productions, filtered by the left-hand side or the first item in the right-hand side.
Parameters: - lhs – Only return productions with the given left-hand side.
- rhs – Only return productions with the given first item in the right-hand side.
- empty – Only return productions with an empty right-hand side.
Returns: A list of productions matching the given constraints.
Return type:
-
start
()[source]¶ Return the start symbol of the grammar
Return type: Nonterminal
-
unicode_repr
()¶
-
-
class
nltk.grammar.
Production
(lhs, rhs)[source]¶ Bases:
object
A grammar production. Each production maps a single symbol on the “left-hand side” to a sequence of symbols on the “right-hand side”. (In the case of context-free productions, the left-hand side must be a
Nonterminal
, and the right-hand side is a sequence of terminals andNonterminals
.) “terminals” can be any immutable hashable object that is not aNonterminal
. Typically, terminals are strings representing words, such as"dog"
or"under"
.See: CFG
See: DependencyGrammar
See: Nonterminal
Variables: - _lhs – The left-hand side of the production.
- _rhs – The right-hand side of the production.
-
is_lexical
()[source]¶ Return True if the right-hand contain at least one terminal token.
Return type: bool
-
is_nonlexical
()[source]¶ Return True if the right-hand side only contains
Nonterminals
Return type: bool
-
lhs
()[source]¶ Return the left-hand side of this
Production
.Return type: Nonterminal
-
class
nltk.grammar.
PCFG
(start, productions, calculate_leftcorners=True)[source]¶ Bases:
nltk.grammar.CFG
A probabilistic context-free grammar. A PCFG consists of a start state and a set of productions with probabilities. The set of terminals and nonterminals is implicitly specified by the productions.
PCFG productions use the
ProbabilisticProduction
class.PCFGs
impose the constraint that the set of productions with any given left-hand-side must have probabilities that sum to 1 (allowing for a small margin of error).If you need efficient key-based access to productions, you can use a subclass to implement it.
Variables: EPSILON – The acceptable margin of error for checking that productions with a given left-hand side have probabilities that sum to 1. -
EPSILON
= 0.01¶
-
-
class
nltk.grammar.
ProbabilisticProduction
(lhs, rhs, **prob)[source]¶ Bases:
nltk.grammar.Production
,nltk.probability.ImmutableProbabilisticMixIn
A probabilistic context free grammar production. A PCFG
ProbabilisticProduction
is essentially just aProduction
that has an associated probability, which represents how likely it is that this production will be used. In particular, the probability of aProbabilisticProduction
records the likelihood that its right-hand side is the correct instantiation for any given occurrence of its left-hand side.See: Production
-
class
nltk.grammar.
DependencyGrammar
(productions)[source]¶ Bases:
object
A dependency grammar. A DependencyGrammar consists of a set of productions. Each production specifies a head/modifier relationship between a pair of words.
-
contains
(head, mod)[source]¶ Parameters: Returns: true if this
DependencyGrammar
contains aDependencyProduction
mapping ‘head’ to ‘mod’.Return type: bool
-
unicode_repr
()¶ Return a concise string representation of the
DependencyGrammar
-
-
class
nltk.grammar.
DependencyProduction
(lhs, rhs)[source]¶ Bases:
nltk.grammar.Production
A dependency grammar production. Each production maps a single head word to an unordered list of one or more modifier words.
-
class
nltk.grammar.
ProbabilisticDependencyGrammar
(productions, events, tags)[source]¶ Bases:
object
-
contains
(head, mod)[source]¶ Return True if this
DependencyGrammar
contains aDependencyProduction
mapping ‘head’ to ‘mod’.Parameters: Return type: bool
-
unicode_repr
()¶ Return a concise string representation of the
ProbabilisticDependencyGrammar
-
-
nltk.grammar.
induce_pcfg
(start, productions)[source]¶ Induce a PCFG grammar from a list of productions.
The probability of a production A -> B C in a PCFG is:
count(A -> B C)P(B, C | A) = ————— where * is any right hand sidecount(A -> *)Parameters: - start (Nonterminal) – The start symbol
- productions (list(Production)) – The list of productions that defines the grammar
-
nltk.grammar.
read_grammar
(input, nonterm_parser, probabilistic=False, encoding=None)[source]¶ Return a pair consisting of a starting category and a list of
Productions
.Parameters: - input – a grammar, either in the form of a string or else as a list of strings.
- nonterm_parser – a function for parsing nonterminals.
It should take a
(string, position)
as argument and return a(nonterminal, position)
as result. - probabilistic (bool) – are the grammar rules probabilistic?
- encoding (str) – the encoding of the grammar, if it is a binary string
probability
Module¶
Classes for representing and processing probabilistic information.
The FreqDist
class is used to encode “frequency distributions”,
which count the number of times that each outcome of an experiment
occurs.
The ProbDistI
class defines a standard interface for “probability
distributions”, which encode the probability of each outcome for an
experiment. There are two types of probability distribution:
- “derived probability distributions” are created from frequency distributions. They attempt to model the probability distribution that generated the frequency distribution.
- “analytic probability distributions” are created directly from parameters (such as variance).
The ConditionalFreqDist
class and ConditionalProbDistI
interface
are used to encode conditional distributions. Conditional probability
distributions can be derived or analytic; but currently the only
implementation of the ConditionalProbDistI
interface is
ConditionalProbDist
, a derived distribution.
-
class
nltk.probability.
ConditionalFreqDist
(cond_samples=None)[source]¶ Bases:
collections.defaultdict
A collection of frequency distributions for a single experiment run under different conditions. Conditional frequency distributions are used to record the number of times each sample occurred, given the condition under which the experiment was run. For example, a conditional frequency distribution could be used to record the frequency of each word (type) in a document, given its length. Formally, a conditional frequency distribution can be defined as a function that maps from each condition to the FreqDist for the experiment under that condition.
Conditional frequency distributions are typically constructed by repeatedly running an experiment under a variety of conditions, and incrementing the sample outcome counts for the appropriate conditions. For example, the following code will produce a conditional frequency distribution that encodes how often each word type occurs, given the length of that word type:
>>> from nltk.probability import ConditionalFreqDist >>> from nltk.tokenize import word_tokenize >>> sent = "the the the dog dog some other words that we do not care about" >>> cfdist = ConditionalFreqDist() >>> for word in word_tokenize(sent): ... condition = len(word) ... cfdist[condition][word] += 1
An equivalent way to do this is with the initializer:
>>> cfdist = ConditionalFreqDist((len(word), word) for word in word_tokenize(sent))
The frequency distribution for each condition is accessed using the indexing operator:
>>> cfdist[3] FreqDist({'the': 3, 'dog': 2, 'not': 1}) >>> cfdist[3].freq('the') 0.5 >>> cfdist[3]['dog'] 2
When the indexing operator is used to access the frequency distribution for a condition that has not been accessed before,
ConditionalFreqDist
creates a new empty FreqDist for that condition.-
N
()[source]¶ Return the total number of sample outcomes that have been recorded by this
ConditionalFreqDist
.Return type: int
-
conditions
()[source]¶ Return a list of the conditions that have been accessed for this
ConditionalFreqDist
. Use the indexing operator to access the frequency distribution for a given condition. Note that the frequency distributions for some conditions may contain zero sample outcomes.Return type: list
-
plot
(*args, **kwargs)[source]¶ Plot the given samples from the conditional frequency distribution. For a cumulative plot, specify cumulative=True. (Requires Matplotlib to be installed.)
Parameters:
-
-
class
nltk.probability.
ConditionalProbDist
(cfdist, probdist_factory, *factory_args, **factory_kw_args)[source]¶ Bases:
nltk.probability.ConditionalProbDistI
A conditional probability distribution modeling the experiments that were used to generate a conditional frequency distribution. A ConditionalProbDist is constructed from a
ConditionalFreqDist
and aProbDist
factory:- The
ConditionalFreqDist
specifies the frequency distribution for each condition. - The
ProbDist
factory is a function that takes a condition’s frequency distribution, and returns its probability distribution. AProbDist
class’s name (such asMLEProbDist
orHeldoutProbDist
) can be used to specify that class’s constructor.
The first argument to the
ProbDist
factory is the frequency distribution that it should model; and the remaining arguments are specified by thefactory_args
parameter to theConditionalProbDist
constructor. For example, the following code constructs aConditionalProbDist
, where the probability distribution for each condition is anELEProbDist
with 10 bins:>>> from nltk.corpus import brown >>> from nltk.probability import ConditionalFreqDist >>> from nltk.probability import ConditionalProbDist, ELEProbDist >>> cfdist = ConditionalFreqDist(brown.tagged_words()[:5000]) >>> cpdist = ConditionalProbDist(cfdist, ELEProbDist, 10) >>> cpdist['passed'].max() 'VBD' >>> cpdist['passed'].prob('VBD') 0.423...
- The
-
class
nltk.probability.
ConditionalProbDistI
[source]¶ Bases:
dict
A collection of probability distributions for a single experiment run under different conditions. Conditional probability distributions are used to estimate the likelihood of each sample, given the condition under which the experiment was run. For example, a conditional probability distribution could be used to estimate the probability of each word type in a document, given the length of the word type. Formally, a conditional probability distribution can be defined as a function that maps from each condition to the
ProbDist
for the experiment under that condition.
-
class
nltk.probability.
CrossValidationProbDist
(freqdists, bins)[source]¶ Bases:
nltk.probability.ProbDistI
The cross-validation estimate for the probability distribution of the experiment used to generate a set of frequency distribution. The “cross-validation estimate” for the probability of a sample is found by averaging the held-out estimates for the sample in each pair of frequency distributions.
-
SUM_TO_ONE
= False¶
-
-
class
nltk.probability.
DictionaryConditionalProbDist
(probdist_dict)[source]¶ Bases:
nltk.probability.ConditionalProbDistI
An alternative ConditionalProbDist that simply wraps a dictionary of ProbDists rather than creating these from FreqDists.
-
class
nltk.probability.
DictionaryProbDist
(prob_dict=None, log=False, normalize=False)[source]¶ Bases:
nltk.probability.ProbDistI
A probability distribution whose probabilities are directly specified by a given dictionary. The given dictionary maps samples to probabilities.
-
unicode_repr
()¶
-
-
class
nltk.probability.
ELEProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.LidstoneProbDist
The expected likelihood estimate for the probability distribution of the experiment used to generate a frequency distribution. The “expected likelihood estimate” approximates the probability of a sample with count c from an experiment with N outcomes and B bins as (c+0.5)/(N+B/2). This is equivalent to adding 0.5 to the count for each bin, and taking the maximum likelihood estimate of the resulting frequency distribution.
-
class
nltk.probability.
FreqDist
(samples=None)[source]¶ Bases:
collections.Counter
A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.
Frequency distributions are generally constructed by running a number of experiments, and incrementing the count for a sample every time it is an outcome of an experiment. For example, the following code will produce a frequency distribution that encodes how often each word occurs in a text:
>>> from nltk.tokenize import word_tokenize >>> from nltk.probability import FreqDist >>> sent = 'This is an example sentence' >>> fdist = FreqDist() >>> for word in word_tokenize(sent): ... fdist[word.lower()] += 1
An equivalent way to do this is with the initializer:
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
-
B
()[source]¶ Return the total number of sample values (or “bins”) that have counts greater than zero. For the total number of sample outcomes recorded, use
FreqDist.N()
. (FreqDist.B() is the same as len(FreqDist).)Return type: int
-
N
()[source]¶ Return the total number of sample outcomes that have been recorded by this FreqDist. For the number of unique sample values (or bins) with counts greater than zero, use
FreqDist.B()
.Return type: int
-
freq
(sample)[source]¶ Return the frequency of a given sample. The frequency of a sample is defined as the count of that sample divided by the total number of sample outcomes that have been recorded by this FreqDist. The count of a sample is defined as the number of times that sample outcome was recorded by this FreqDist. Frequencies are always real numbers in the range [0, 1].
Parameters: sample (any) – the sample whose frequency should be returned. Return type: float
-
max
()[source]¶ Return the sample with the greatest number of outcomes in this frequency distribution. If two or more samples have the same number of outcomes, return one of them; which sample is returned is undefined. If no outcomes have occurred in this frequency distribution, return None.
Returns: The sample with the maximum number of outcomes in this frequency distribution. Return type: any or None
-
pformat
(maxlen=10)[source]¶ Return a string representation of this FreqDist.
Parameters: maxlen (int) – The maximum number of items to display Return type: string
-
plot
(*args, **kwargs)[source]¶ Plot samples from the frequency distribution displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted. For a cumulative plot, specify cumulative=True. (Requires Matplotlib to be installed.)
Parameters: - title (bool) – The title for the graph
- cumulative – A flag to specify whether the plot is cumulative (default = False)
-
pprint
(maxlen=10, stream=None)[source]¶ Print a string representation of this FreqDist to ‘stream’
Parameters: - maxlen (int) – The maximum number of items to print
- stream – The stream to print to. stdout by default
-
r_Nr
(bins=None)[source]¶ Return the dictionary mapping r to Nr, the number of samples with frequency r, where Nr > 0.
Parameters: bins (int) – The number of possible sample outcomes. bins
is used to calculate Nr(0). In particular, Nr(0) isbins-self.B()
. Ifbins
is not specified, it defaults toself.B()
(so Nr(0) will be 0).Return type: int
-
tabulate
(*args, **kwargs)[source]¶ Tabulate the given samples from the frequency distribution (cumulative), displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted.
Parameters: - samples (list) – The samples to plot (default is all samples)
- cumulative – A flag to specify whether the freqs are cumulative (default = False)
-
unicode_repr
()¶ Return a string representation of this FreqDist.
Return type: string
-
-
class
nltk.probability.
SimpleGoodTuringProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression. Details of Simple Good-Turing algorithm can be found in:
- Good Turing smoothing without tears” (Gale & Sampson 1995), Journal of Quantitative Linguistics, vol. 2 pp. 217-237.
- “Speech and Language Processing (Jurafsky & Martin), 2nd Edition, Chapter 4.5 p103 (log(Nc) = a + b*log(c))
- http://www.grsampson.net/RGoodTur.html
Given a set of pair (xi, yi), where the xi denotes the frequency and yi denotes the frequency of frequency, we want to minimize their square variation. E(x) and E(y) represent the mean of xi and yi.
- slope: b = sigma ((xi-E(x)(yi-E(y))) / sigma ((xi-E(x))(xi-E(x)))
- intercept: a = E(y) - b.E(x)
-
SUM_TO_ONE
= False¶
-
discount
()[source]¶ This function returns the total mass of probability transfers from the seen samples to the unseen samples.
-
find_best_fit
(r, nr)[source]¶ Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)
-
prob
(sample)[source]¶ Return the sample’s probability.
Parameters: sample (str) – sample of the event Return type: float
-
class
nltk.probability.
HeldoutProbDist
(base_fdist, heldout_fdist, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
The heldout estimate for the probability distribution of the experiment used to generate two frequency distributions. These two frequency distributions are called the “heldout frequency distribution” and the “base frequency distribution.” The “heldout estimate” uses uses the “heldout frequency distribution” to predict the probability of each sample, given its frequency in the “base frequency distribution”.
In particular, the heldout estimate approximates the probability for a sample that occurs r times in the base distribution as the average frequency in the heldout distribution of all samples that occur r times in the base distribution.
This average frequency is Tr[r]/(Nr[r].N), where:
- Tr[r] is the total count in the heldout distribution for all samples that occur r times in the base distribution.
- Nr[r] is the number of samples that occur r times in the base distribution.
- N is the number of outcomes recorded by the heldout frequency distribution.
In order to increase the efficiency of the
prob
member function, Tr[r]/(Nr[r].N) is precomputed for each value of r when theHeldoutProbDist
is created.Variables: - _estimate – A list mapping from r, the number of
times that a sample occurs in the base distribution, to the
probability estimate for that sample.
_estimate[r]
is calculated by finding the average frequency in the heldout distribution of all samples that occur r times in the base distribution. In particular,_estimate[r]
= Tr[r]/(Nr[r].N). - _max_r – The maximum number of times that any sample occurs
in the base distribution.
_max_r
is used to decide how large_estimate
must be.
-
SUM_TO_ONE
= False¶
-
base_fdist
()[source]¶ Return the base frequency distribution that this probability distribution is based on.
Return type: FreqDist
-
class
nltk.probability.
LaplaceProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.LidstoneProbDist
The Laplace estimate for the probability distribution of the experiment used to generate a frequency distribution. The “Laplace estimate” approximates the probability of a sample with count c from an experiment with N outcomes and B bins as (c+1)/(N+B). This is equivalent to adding one to the count for each bin, and taking the maximum likelihood estimate of the resulting frequency distribution.
-
class
nltk.probability.
LidstoneProbDist
(freqdist, gamma, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
The Lidstone estimate for the probability distribution of the experiment used to generate a frequency distribution. The “Lidstone estimate” is parameterized by a real number gamma, which typically ranges from 0 to 1. The Lidstone estimate approximates the probability of a sample with count c from an experiment with N outcomes and B bins as
c+gamma)/(N+B*gamma)
. This is equivalent to adding gamma to the count for each bin, and taking the maximum likelihood estimate of the resulting frequency distribution.-
SUM_TO_ONE
= False¶
-
-
class
nltk.probability.
MLEProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
The maximum likelihood estimate for the probability distribution of the experiment used to generate a frequency distribution. The “maximum likelihood estimate” approximates the probability of each sample as the frequency of that sample in the frequency distribution.
-
class
nltk.probability.
MutableProbDist
(prob_dist, samples, store_logs=True)[source]¶ Bases:
nltk.probability.ProbDistI
An mutable probdist where the probabilities may be easily modified. This simply copies an existing probdist, storing the probability values in a mutable dictionary and providing an update method.
-
update
(sample, prob, log=True)[source]¶ Update the probability for the given sample. This may cause the object to stop being the valid probability distribution - the user must ensure that they update the sample probabilities such that all samples have probabilities between 0 and 1 and that all probabilities sum to one.
Parameters: - sample (any) – the sample for which to update the probability
- prob (float) – the new probability
- log (bool) – is the probability already logged
-
-
class
nltk.probability.
KneserNeyProbDist
(freqdist, bins=None, discount=0.75)[source]¶ Bases:
nltk.probability.ProbDistI
Kneser-Ney estimate of a probability distribution. This is a version of back-off that counts how likely an n-gram is provided the n-1-gram had been seen in training. Extends the ProbDistI interface, requires a trigram FreqDist instance to train on. Optionally, a different from default discount value can be specified. The default discount is set to 0.75.
-
discount
()[source]¶ Return the value by which counts are discounted. By default set to 0.75.
Return type: float
-
-
class
nltk.probability.
ProbDistI
[source]¶ Bases:
object
A probability distribution for the outcomes of an experiment. A probability distribution specifies how likely it is that an experiment will have any given outcome. For example, a probability distribution could be used to predict the probability that a token in a document will have a given type. Formally, a probability distribution can be defined as a function mapping from samples to nonnegative real numbers, such that the sum of every number in the function’s range is 1.0. A
ProbDist
is often used to model the probability distribution of the experiment used to generate a frequency distribution.-
SUM_TO_ONE
= True¶ True if the probabilities of the samples in this probability distribution will always sum to one.
-
discount
()[source]¶ Return the ratio by which counts are discounted on average: c*/c
Return type: float
-
generate
()[source]¶ Return a randomly selected sample from this probability distribution. The probability of returning each sample
samp
is equal toself.prob(samp)
.
-
logprob
(sample)[source]¶ Return the base 2 logarithm of the probability for a given sample.
Parameters: sample (any) – The sample whose probability should be returned. Return type: float
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
Return type: any
-
-
class
nltk.probability.
ProbabilisticMixIn
(**kwargs)[source]¶ Bases:
object
A mix-in class to associate probabilities with other classes (trees, rules, etc.). To use the
ProbabilisticMixIn
class, define a new class that derives from an existing class and from ProbabilisticMixIn. You will need to define a new constructor for the new class, which explicitly calls the constructors of both its parent classes. For example:>>> from nltk.probability import ProbabilisticMixIn >>> class A: ... def __init__(self, x, y): self.data = (x,y) ... >>> class ProbabilisticA(A, ProbabilisticMixIn): ... def __init__(self, x, y, **prob_kwarg): ... A.__init__(self, x, y) ... ProbabilisticMixIn.__init__(self, **prob_kwarg)
See the documentation for the ProbabilisticMixIn
constructor<__init__>
for information about the arguments it expects.You should generally also redefine the string representation methods, the comparison methods, and the hashing method.
-
logprob
()[source]¶ Return
log(p)
, wherep
is the probability associated with this object.Return type: float
-
-
class
nltk.probability.
UniformProbDist
(samples)[source]¶ Bases:
nltk.probability.ProbDistI
A probability distribution that assigns equal probability to each sample in a given set; and a zero probability to all other samples.
-
unicode_repr
()¶
-
-
class
nltk.probability.
WittenBellProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
The Witten-Bell estimate of a probability distribution. This distribution allocates uniform probability mass to as yet unseen events by using the number of events that have only been seen once. The probability mass reserved for unseen events is equal to T / (N + T) where T is the number of observed event types and N is the total number of observed events. This equates to the maximum likelihood estimate of a new type event occurring. The remaining probability mass is discounted such that all probability estimates sum to one, yielding:
- p = T / Z (N + T), if count = 0
- p = c / (N + T), otherwise
text
Module¶
This module brings together a variety of NLTK functionality for text analysis, and provides simple, interactive interfaces. Functionality includes: concordancing, collocation discovery, regular expression search over tokenized strings, and distributional similarity.
-
class
nltk.text.
ContextIndex
(tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda>>)[source]¶ Bases:
object
A bidirectional index between words and their ‘contexts’ in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function.
-
common_contexts
(words, fail_on_unknown=False)[source]¶ Find contexts where the specified words can all appear; and return a frequency distribution mapping each context to the number of times that context was used.
Parameters: - words (str) – The words used to seed the similarity search
- fail_on_unknown – If true, then raise a value error if any of the given words do not occur at all in the index.
-
-
class
nltk.text.
ConcordanceIndex
(tokens, key=<function ConcordanceIndex.<lambda>>)[source]¶ Bases:
object
An index that can be used to look up the offset locations at which a given word occurs in a document.
-
offsets
(word)[source]¶ Return type: list(int) Returns: A list of the offset positions at which the given word occurs. If a key function was specified for the index, then given word’s key will be looked up.
-
print_concordance
(word, width=75, lines=25)[source]¶ Print a concordance for
word
with the specified context window.Parameters: - word (str) – The target word
- width (int) – The width of each line, in characters (default=80)
- lines (int) – The number of lines to display (default=25)
-
tokens
()[source]¶ Return type: list(str) Returns: The document that this concordance index was created from.
-
unicode_repr
()¶
-
-
class
nltk.text.
TokenSearcher
(tokens)[source]¶ Bases:
object
A class that makes it easier to use regular expressions to search over tokenized strings. The tokenized string is converted to a string where tokens are marked with angle brackets – e.g.,
'<the><window><is><still><open>'
. The regular expression passed to thefindall()
method is modified to treat angle brackets as non-capturing parentheses, in addition to matching the token boundaries; and to have'.'
not match the angle brackets.-
findall
(regexp)[source]¶ Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
>>> from nltk.text import TokenSearcher >>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> text1.findall("<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("<th.*>{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the
Parameters: regexp (str) – A regular expression
-
-
class
nltk.text.
Text
(tokens, name=None)[source]¶ Bases:
object
A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the
Text
class, and use the appropriate analysis function or class directly instead.A
Text
is typically initialized from a given document or corpus. E.g.:>>> import nltk.corpus >>> from nltk.text import Text >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
-
collocations
(num=20, window_size=2)[source]¶ Print collocations derived from the text, ignoring stopwords.
Seealso: find_collocations
Parameters: - num (int) – The maximum number of collocations to print.
- window_size (int) – The number of tokens spanned by a collocation (default=2)
-
common_contexts
(words, num=20)[source]¶ Find contexts where the specified words appear; list most frequent common contexts first.
Parameters: - word (str) – The word used to seed the similarity search
- num (int) – The number of words to generate (default=20)
Seealso: ContextIndex.common_contexts()
-
concordance
(word, width=79, lines=25)[source]¶ Print a concordance for
word
with the specified context window. Word matching is not case-sensitive. :seealso:ConcordanceIndex
-
dispersion_plot
(words)[source]¶ Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.
Parameters: words (list(str)) – The words to be plotted Seealso: nltk.draw.dispersion_plot()
-
findall
(regexp)[source]¶ Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
>>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> text1.findall("<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("<th.*>{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the
Parameters: regexp (str) – A regular expression
-
similar
(word, num=20)[source]¶ Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
Parameters: - word (str) – The word used to seed the similarity search
- num (int) – The number of words to generate (default=20)
Seealso: ContextIndex.similar_words()
-
unicode_repr
()¶
-
-
class
nltk.text.
TextCollection
(source)[source]¶ Bases:
nltk.text.Text
A collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc. Initialize a TextCollection as follows:
>>> import nltk.corpus >>> from nltk.text import TextCollection >>> print('hack'); from nltk.book import text1, text2, text3 hack... >>> gutenberg = TextCollection(nltk.corpus.gutenberg) >>> mytexts = TextCollection([text1, text2, text3])
Iterating over a TextCollection produces all the tokens of all the texts in order.
toolbox
Module¶
Module for reading, writing and manipulating Toolbox databases and settings files.
-
class
nltk.toolbox.
StandardFormat
(filename=None, encoding=None)[source]¶ Bases:
object
Class for reading and processing standard format marker files and strings.
-
fields
(strip=True, unwrap=True, encoding=None, errors='strict', unicode_fields=None)[source]¶ Return an iterator that returns the next field in a
(marker, value)
tuple, wheremarker
andvalue
are unicode strings if anencoding
was specified in thefields()
method. Otherwise they are non-unicode strings.Parameters: - strip (bool) – strip trailing whitespace from the last line of each field
- unwrap (bool) – Convert newlines in a field to spaces.
- encoding (str or None) – Name of an encoding to use. If it is specified then
the
fields()
method returns unicode strings rather than non unicode strings. - errors (str) – Error handling scheme for codec. Same as the
decode()
builtin string method. - unicode_fields (sequence) – Set of marker names whose values are UTF-8 encoded.
Ignored if encoding is None. If the whole file is UTF-8 encoded set
encoding='utf8'
and leaveunicode_fields
with its default value of None.
Return type:
-
open
(sfm_file)[source]¶ Open a standard format marker file for sequential reading.
Parameters: sfm_file (str) – name of the standard format marker input file
-
-
class
nltk.toolbox.
ToolboxData
(filename=None, encoding=None)[source]¶ Bases:
nltk.toolbox.StandardFormat
-
class
nltk.toolbox.
ToolboxSettings
[source]¶ Bases:
nltk.toolbox.StandardFormat
This class is the base class for settings files.
-
nltk.toolbox.
add_blank_lines
(tree, blanks_before, blanks_between)[source]¶ Add blank lines before all elements and subelements specified in blank_before.
Parameters: - elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
- blank_before (dict(tuple)) – elements and subelements to add blank lines before
-
nltk.toolbox.
add_default_fields
(elem, default_fields)[source]¶ Add blank elements and subelements specified in default_fields.
Parameters: - elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
- default_fields (dict(tuple)) – fields to add to each type of element and subelement
-
nltk.toolbox.
remove_blanks
(elem)[source]¶ Remove all elements and subelements with no text and no child elements.
Parameters: elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
-
nltk.toolbox.
sort_fields
(elem, field_orders)[source]¶ Sort the elements and subelements in order specified in field_orders.
Parameters: - elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
- field_orders (dict(tuple)) – order of fields for each type of element and subelement
translate
Module¶
Experimental features for machine translation. These interfaces are prone to change.
tree
Module¶
Class for representing hierarchical language structures, such as syntax trees and morphological trees.
-
class
nltk.tree.
ImmutableProbabilisticTree
(node, children=None, **prob_kwargs)[source]¶ Bases:
nltk.tree.ImmutableTree
,nltk.probability.ProbabilisticMixIn
-
unicode_repr
()¶
-
-
class
nltk.tree.
ImmutableTree
(node, children=None)[source]¶ Bases:
nltk.tree.Tree
-
class
nltk.tree.
ProbabilisticMixIn
(**kwargs)[source]¶ Bases:
object
A mix-in class to associate probabilities with other classes (trees, rules, etc.). To use the
ProbabilisticMixIn
class, define a new class that derives from an existing class and from ProbabilisticMixIn. You will need to define a new constructor for the new class, which explicitly calls the constructors of both its parent classes. For example:>>> from nltk.probability import ProbabilisticMixIn >>> class A: ... def __init__(self, x, y): self.data = (x,y) ... >>> class ProbabilisticA(A, ProbabilisticMixIn): ... def __init__(self, x, y, **prob_kwarg): ... A.__init__(self, x, y) ... ProbabilisticMixIn.__init__(self, **prob_kwarg)
See the documentation for the ProbabilisticMixIn
constructor<__init__>
for information about the arguments it expects.You should generally also redefine the string representation methods, the comparison methods, and the hashing method.
-
logprob
()[source]¶ Return
log(p)
, wherep
is the probability associated with this object.Return type: float
-
-
class
nltk.tree.
ProbabilisticTree
(node, children=None, **prob_kwargs)[source]¶ Bases:
nltk.tree.Tree
,nltk.probability.ProbabilisticMixIn
-
unicode_repr
()¶
-
-
class
nltk.tree.
Tree
(node, children=None)[source]¶ Bases:
list
A Tree represents a hierarchical grouping of leaves and subtrees. For example, each constituent in a syntax tree is represented by a single Tree.
A tree’s children are encoded as a list of leaves and subtrees, where a leaf is a basic (non-tree) value; and a subtree is a nested Tree.
>>> from nltk.tree import Tree >>> print(Tree(1, [2, Tree(3, [4]), 5])) (1 2 (3 4) 5) >>> vp = Tree('VP', [Tree('V', ['saw']), ... Tree('NP', ['him'])]) >>> s = Tree('S', [Tree('NP', ['I']), vp]) >>> print(s) (S (NP I) (VP (V saw) (NP him))) >>> print(s[1]) (VP (V saw) (NP him)) >>> print(s[1,1]) (NP him) >>> t = Tree.fromstring("(S (NP I) (VP (V saw) (NP him)))") >>> s == t True >>> t[1][1].set_label('X') >>> t[1][1].label() 'X' >>> print(t) (S (NP I) (VP (V saw) (X him))) >>> t[0], t[1,1] = t[1,1], t[0] >>> print(t) (S (X him) (VP (V saw) (NP I)))
The length of a tree is the number of children it has.
>>> len(t) 2
The set_label() and label() methods allow individual constituents to be labeled. For example, syntax trees use this label to specify phrase tags, such as “NP” and “VP”.
Several Tree methods use “tree positions” to specify children or descendants of a tree. Tree positions are defined as follows:
- The tree position i specifies a Tree’s ith child.
- The tree position
()
specifies the Tree itself. - If p is the tree position of descendant d, then p+i specifies the ith child of d.
I.e., every tree position is either a single index i, specifying
tree[i]
; or a sequence i1, i2, ..., iN, specifyingtree[i1][i2]...[iN]
.Construct a new tree. This constructor can be called in one of two ways:
Tree(label, children)
constructs a new tree with the- specified label and list of children.
Tree.fromstring(s)
constructs a new tree by parsing the strings
.
-
chomsky_normal_form
(factor='right', horzMarkov=None, vertMarkov=0, childChar='|', parentChar='^')[source]¶ This method can modify a tree in three ways:
- Convert a tree into its Chomsky Normal Form (CNF) equivalent – Every subtree has either two non-terminals or one terminal as its children. This process requires the creation of more”artificial” non-terminal nodes.
- Markov (vertical) smoothing of children in new artificial nodes
- Horizontal (parent) annotation of nodes
Parameters: - factor (str = [left|right]) – Right or left factoring method (default = “right”)
- horzMarkov (int | None) – Markov order for sibling smoothing in artificial nodes (None (default) = include all siblings)
- vertMarkov (int | None) – Markov order for parent smoothing (0 (default) = no vertical annotation)
- childChar (str) – A string used in construction of the artificial nodes, separating the head of the original subtree from the child nodes that have yet to be expanded (default = “|”)
- parentChar (str) – A string used to separate the node representation from its vertical annotation
-
collapse_unary
(collapsePOS=False, collapseRoot=False, joinChar='+')[source]¶ Collapse subtrees with a single child (ie. unary productions) into a new non-terminal (Tree node) joined by ‘joinChar’. This is useful when working with algorithms that do not allow unary productions, and completely removing the unary productions would require loss of useful information. The Tree is modified directly (since it is passed by reference) and no value is returned.
Parameters: - collapsePOS (bool) – ‘False’ (default) will not collapse the parent of leaf nodes (ie. Part-of-Speech tags) since they are always unary productions
- collapseRoot (bool) – ‘False’ (default) will not modify the root production if it is unary. For the Penn WSJ treebank corpus, this corresponds to the TOP -> productions.
- joinChar (str) – A string used to connect collapsed node values (default = “+”)
-
classmethod
convert
(tree)[source]¶ Convert a tree between different subtypes of Tree.
cls
determines which class will be used to encode the new tree.Parameters: tree (Tree) – The tree that should be converted. Returns: The new Tree.
-
flatten
()[source]¶ Return a flat version of the tree, with all non-root non-terminals removed.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> print(t.flatten()) (S the dog chased the cat)
Returns: a tree consisting of this tree’s root connected directly to its leaves, omitting all intervening non-terminal nodes. Return type: Tree
-
classmethod
fromstring
(s, brackets='()', read_node=None, read_leaf=None, node_pattern=None, leaf_pattern=None, remove_empty_top_bracketing=False)[source]¶ Read a bracketed tree string and return the resulting tree. Trees are represented as nested brackettings, such as:
(S (NP (NNP John)) (VP (V runs)))
Parameters: - s (str) – The string to read
- brackets (str (length=2)) – The bracket characters used to mark the beginning and end of trees and subtrees.
- read_leaf (read_node,) –
If specified, these functions are applied to the substrings of
s
corresponding to nodes and leaves (respectively) to obtain the values for those nodes and leaves. They should have the following signature:read_node(str) -> valueFor example, these functions could be used to process nodes and leaves whose values should be some type other than string (such as
FeatStruct
). Note that by default, node strings and leaf strings are delimited by whitespace and brackets; to override this default, use thenode_pattern
andleaf_pattern
arguments. - leaf_pattern (node_pattern,) – Regular expression patterns
used to find node and leaf substrings in
s
. By default, both nodes patterns are defined to match any sequence of non-whitespace non-bracket characters. - remove_empty_top_bracketing (bool) – If the resulting tree has an empty node label, and is length one, then return its single child instead. This is useful for treebank trees, which sometimes contain an extra level of bracketing.
Returns: A tree corresponding to the string representation
s
. If this class method is called using a subclass of Tree, then it will return a tree of that type.Return type:
-
height
()[source]¶ Return the height of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.height() 5 >>> print(t[0,0]) (D the) >>> t[0,0].height() 2
Returns: The height of this tree. The height of a tree containing no children is 1; the height of a tree containing only leaves is 2; and the height of any other tree is one plus the maximum of its children’s heights. Return type: int
-
label
()[source]¶ Return the node label of the tree.
>>> t = Tree.fromstring('(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))') >>> t.label() 'S'
Returns: the node label (typically a string) Return type: any
-
leaf_treeposition
(index)[source]¶ Returns: The tree position of the index
-th leaf in this tree. I.e., iftp=self.leaf_treeposition(i)
, thenself[tp]==self.leaves()[i]
.Raises: IndexError – If this tree contains fewer than index+1
leaves, or ifindex<0
.
-
leaves
()[source]¶ Return the leaves of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.leaves() ['the', 'dog', 'chased', 'the', 'cat']
Returns: a list containing this tree’s leaves. The order reflects the order of the leaves in the tree’s hierarchical structure. Return type: list
-
node
¶ Outdated method to access the node value; use the label() method instead.
-
pformat
(margin=70, indent=0, nodesep='', parens='()', quotes=False)[source]¶ Returns: A pretty-printed string representation of this tree.
Return type: Parameters: - margin (int) – The right margin at which to do line-wrapping.
- indent (int) – The indentation level at which printing begins. This number is used to decide how far to indent subsequent lines.
- nodesep – A string that is used to separate the node
from the children. E.g., the default value
':'
gives trees like(S: (NP: I) (VP: (V: saw) (NP: it)))
.
-
pformat_latex_qtree
()[source]¶ Returns a representation of the tree compatible with the LaTeX qtree package. This consists of the string
\Tree
followed by the tree represented in bracketed notation.For example, the following result was generated from a parse tree of the sentence
The announcement astounded us
:\Tree [.I'' [.N'' [.D The ] [.N' [.N announcement ] ] ] [.I' [.V'' [.V' [.V astounded ] [.N'' [.N' [.N us ] ] ] ] ] ] ]
See http://www.ling.upenn.edu/advice/latex.html for the LaTeX style file for the qtree package.
Returns: A latex qtree representation of this tree. Return type: str
-
pos
()[source]¶ Return a sequence of pos-tagged words extracted from the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.pos() [('the', 'D'), ('dog', 'N'), ('chased', 'V'), ('the', 'D'), ('cat', 'N')]
Returns: a list of tuples containing leaves and pre-terminals (part-of-speech tags). The order reflects the order of the leaves in the tree’s hierarchical structure. Return type: list(tuple)
-
pretty_print
(sentence=None, highlight=(), stream=None, **kwargs)[source]¶ Pretty-print this tree as ASCII or Unicode art. For explanation of the arguments, see the documentation for nltk.treeprettyprinter.TreePrettyPrinter.
-
productions
()[source]¶ Generate the productions that correspond to the non-terminal nodes of the tree. For each subtree of the form (P: C1 C2 ... Cn) this produces a production of the form P -> C1 C2 ... Cn.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.productions() [S -> NP VP, NP -> D N, D -> 'the', N -> 'dog', VP -> V NP, V -> 'chased', NP -> D N, D -> 'the', N -> 'cat']
Return type: list(Production)
-
set_label
(label)[source]¶ Set the node label of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.set_label("T") >>> print(t) (T (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))
Parameters: label (any) – the node label (typically a string)
-
subtrees
(filter=None)[source]¶ Generate all the subtrees of this tree, optionally restricted to trees matching the filter function.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> for s in t.subtrees(lambda t: t.height() == 2): ... print(s) (D the) (N dog) (V chased) (D the) (N cat)
Parameters: filter (function) – the function to filter all local trees
-
treeposition_spanning_leaves
(start, end)[source]¶ Returns: The tree position of the lowest descendant of this tree that dominates self.leaves()[start:end]
.Raises: ValueError – if end <= start
-
treepositions
(order='preorder')[source]¶ >>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.treepositions() [(), (0,), (0, 0), (0, 0, 0), (0, 1), (0, 1, 0), (1,), (1, 0), (1, 0, 0), ...] >>> for pos in t.treepositions('leaves'): ... t[pos] = t[pos][::-1].upper() >>> print(t) (S (NP (D EHT) (N GOD)) (VP (V DESAHC) (NP (D EHT) (N TAC))))
Parameters: order – One of: preorder
,postorder
,bothorder
,leaves
.
-
un_chomsky_normal_form
(expandUnary=True, childChar='|', parentChar='^', unaryChar='+')[source]¶ This method modifies the tree in three ways:
- Transforms a tree in Chomsky Normal Form back to its original structure (branching greater than two)
- Removes any parent annotation (if it exists)
- (optional) expands unary subtrees (if previously collapsed with collapseUnary(...) )
Parameters: - expandUnary (bool) – Flag to expand unary or not (default = True)
- childChar (str) – A string separating the head node from its children in an artificial node (default = “|”)
- parentChar (str) – A sting separating the node label from its parent annotation (default = “^”)
- unaryChar (str) – A string joining two non-terminals in a unary production (default = “+”)
-
unicode_repr
()¶
-
nltk.tree.
sinica_parse
(s)[source]¶ Parse a Sinica Treebank string and return a tree. Trees are represented as nested brackettings, as shown in the following example (X represents a Chinese character): S(goal:NP(Head:Nep:XX)|theme:NP(Head:Nhaa:X)|quantity:Dab:X|Head:VL2:X)#0(PERIODCATEGORY)
Returns: A tree corresponding to the string representation. Return type: Tree Parameters: s (str) – The string to be converted
-
class
nltk.tree.
ParentedTree
(node, children=None)[source]¶ Bases:
nltk.tree.AbstractParentedTree
A
Tree
that automatically maintains parent pointers for single-parented trees. The following are methods for querying the structure of a parented tree:parent
,parent_index
,left_sibling
,right_sibling
,root
,treeposition
.Each
ParentedTree
may have at most one parent. In particular, subtrees may not be shared. Any attempt to reuse a singleParentedTree
as a child of more than one parent (or as multiple children of the same parent) will cause aValueError
exception to be raised.ParentedTrees
should never be used in the same tree asTrees
orMultiParentedTrees
. Mixing tree implementations may result in incorrect parent pointers and inTypeError
exceptions.-
parent_index
()[source]¶ The index of this tree in its parent. I.e.,
ptree.parent()[ptree.parent_index()] is ptree
. Note thatptree.parent_index()
is not necessarily equal toptree.parent.index(ptree)
, since theindex()
method returns the first child that is equal to its argument.
-
-
class
nltk.tree.
MultiParentedTree
(node, children=None)[source]¶ Bases:
nltk.tree.AbstractParentedTree
A
Tree
that automatically maintains parent pointers for multi-parented trees. The following are methods for querying the structure of a multi-parented tree:parents()
,parent_indices()
,left_siblings()
,right_siblings()
,roots
,treepositions
.Each
MultiParentedTree
may have zero or more parents. In particular, subtrees may be shared. If a singleMultiParentedTree
is used as multiple children of the same parent, then that parent will appear multiple times in itsparents()
method.MultiParentedTrees
should never be used in the same tree asTrees
orParentedTrees
. Mixing tree implementations may result in incorrect parent pointers and inTypeError
exceptions.-
left_siblings
()[source]¶ A list of all left siblings of this tree, in any of its parent trees. A tree may be its own left sibling if it is used as multiple contiguous children of the same parent. A tree may appear multiple times in this list if it is the left sibling of this tree with respect to multiple parents.
Type: list(MultiParentedTree)
-
parent_indices
(parent)[source]¶ Return a list of the indices where this tree occurs as a child of
parent
. If this child does not occur as a child ofparent
, then the empty list is returned. The following is always true:for parent_index in ptree.parent_indices(parent): parent[parent_index] is ptree
-
parents
()[source]¶ The set of parents of this tree. If this tree has no parents, then
parents
is the empty set. To check if a tree is used as multiple children of the same parent, use theparent_indices()
method.Type: list(MultiParentedTree)
-
right_siblings
()[source]¶ A list of all right siblings of this tree, in any of its parent trees. A tree may be its own right sibling if it is used as multiple contiguous children of the same parent. A tree may appear multiple times in this list if it is the right sibling of this tree with respect to multiple parents.
Type: list(MultiParentedTree)
-
treetransforms
Module¶
A collection of methods for tree (grammar) transformations used in parsing natural language.
Although many of these methods are technically grammar transformations (ie. Chomsky Norm Form), when working with treebanks it is much more natural to visualize these modifications in a tree structure. Hence, we will do all transformation directly to the tree itself. Transforming the tree directly also allows us to do parent annotation. A grammar can then be simply induced from the modified tree.
The following is a short tutorial on the available transformations.
Chomsky Normal Form (binarization)
It is well known that any grammar has a Chomsky Normal Form (CNF) equivalent grammar where CNF is defined by every production having either two non-terminals or one terminal on its right hand side. When we have hierarchically structured data (ie. a treebank), it is natural to view this in terms of productions where the root of every subtree is the head (left hand side) of the production and all of its children are the right hand side constituents. In order to convert a tree into CNF, we simply need to ensure that every subtree has either two subtrees as children (binarization), or one leaf node (non-terminal). In order to binarize a subtree with more than two children, we must introduce artificial nodes.
There are two popular methods to convert a tree into CNF: left factoring and right factoring. The following example demonstrates the difference between them. Example:
Original Right-Factored Left-Factored A A A / | \ / \ / B C D ==> B A|<C-D> OR A|<B-C> D / \ / C D B CParent Annotation
In addition to binarizing the tree, there are two standard modifications to node labels we can do in the same traversal: parent annotation and Markov order-N smoothing (or sibling smoothing).
The purpose of parent annotation is to refine the probabilities of productions by adding a small amount of context. With this simple addition, a CYK (inside-outside, dynamic programming chart parse) can improve from 74% to 79% accuracy. A natural generalization from parent annotation is to grandparent annotation and beyond. The tradeoff becomes accuracy gain vs. computational complexity. We must also keep in mind data sparcity issues. Example:
Original Parent Annotation A A^<?> / | \ / B C D ==> B^<A> A|<C-D>^<?> where ? is the / \ parent of A C^<A> D^<A>Markov order-N smoothing
Markov smoothing combats data sparcity issues as well as decreasing computational requirements by limiting the number of children included in artificial nodes. In practice, most people use an order 2 grammar. Example:
Original No Smoothing Markov order 1 Markov order 2 etc. __A__ A A A / /|\ \ / \ / \ / B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D> / \ / \ / C ... C ... C ...Annotation decisions can be thought about in the vertical direction (parent, grandparent, etc) and the horizontal direction (number of siblings to keep). Parameters to the following functions specify these values. For more information see:
Dan Klein and Chris Manning (2003) “Accurate Unlexicalized Parsing”, ACL-03. http://www.aclweb.org/anthology/P03-1054
Unary Collapsing
Collapse unary productions (ie. subtrees with a single child) into a new non-terminal (Tree node). This is useful when working with algorithms that do not allow unary productions, yet you do not wish to lose the parent information. Example:
A | B ==> A+B / \ / C D C D
-
nltk.treetransforms.
chomsky_normal_form
(tree, factor='right', horzMarkov=None, vertMarkov=0, childChar='|', parentChar='^')[source]¶
-
nltk.treetransforms.
un_chomsky_normal_form
(tree, expandUnary=True, childChar='|', parentChar='^', unaryChar='+')[source]¶
-
nltk.treetransforms.
collapse_unary
(tree, collapsePOS=False, collapseRoot=False, joinChar='+')[source]¶ Collapse subtrees with a single child (ie. unary productions) into a new non-terminal (Tree node) joined by ‘joinChar’. This is useful when working with algorithms that do not allow unary productions, and completely removing the unary productions would require loss of useful information. The Tree is modified directly (since it is passed by reference) and no value is returned.
Parameters: - tree (Tree) – The Tree to be collapsed
- collapsePOS (bool) – ‘False’ (default) will not collapse the parent of leaf nodes (ie. Part-of-Speech tags) since they are always unary productions
- collapseRoot (bool) – ‘False’ (default) will not modify the root production if it is unary. For the Penn WSJ treebank corpus, this corresponds to the TOP -> productions.
- joinChar (str) – A string used to connect collapsed node values (default = “+”)
util
Module¶
-
nltk.util.
bigrams
(sequence, **kwargs)[source]¶ Return the bigrams generated from a sequence of items, as an iterator. For example:
>>> from nltk.util import bigrams >>> list(bigrams([1,2,3,4,5])) [(1, 2), (2, 3), (3, 4), (4, 5)]
Use bigrams for a list version of this function.
Parameters: sequence (sequence or iter) – the source data to be converted into bigrams Return type: iter(tuple)
-
nltk.util.
binary_search_file
(file, key, cache={}, cacheDepth=-1)[source]¶ Return the line from the file with first word key. Searches through a sorted file using the binary search algorithm.
Parameters: - file (file) – the file to be searched through.
- key (str) – the identifier we are searching for.
-
nltk.util.
breadth_first
(tree, children=<built-in function iter>, maxdepth=-1)[source]¶ Traverse the nodes of a tree in breadth-first order. (No need to check for cycles.) The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.
-
nltk.util.
choose
(n, k)[source]¶ This function is a fast way to calculate binomial coefficients, commonly known as nCk, i.e. the number of combinations of n things taken k at a time. (https://en.wikipedia.org/wiki/Binomial_coefficient).
This is the scipy.special.comb() with long integer computation but this approximation is faster, see https://github.com/nltk/nltk/issues/1181
>>> choose(4, 2) 6 >>> choose(6, 2) 15
Parameters: - n (int) – The number of things.
- r (int) – The number of times a thing is taken.
-
nltk.util.
elementtree_indent
(elem, level=0)[source]¶ Recursive function to indent an ElementTree._ElementInterface used for pretty printing. Run indent on elem and then output in the normal way.
Parameters: - elem (ElementTree._ElementInterface) – element to be indented. will be modified.
- level (nonnegative integer) – level of indentation for this element
Return type: ElementTree._ElementInterface
Returns: Contents of elem indented to reflect its structure
-
nltk.util.
everygrams
(sequence, min_len=1, max_len=-1, **kwargs)[source]¶ Returns all possible ngrams generated from a sequence of items, as an iterator.
>>> sent = 'a b c'.split() >>> list(everygrams(sent)) [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')] >>> list(everygrams(sent, max_len=2)) [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c')]
Parameters: - sequence (sequence or iter) – the source data to be converted into trigrams
- min_len (int) – minimum length of the ngrams, aka. n-gram order/degree of ngram
- max_len (int) – maximum length of the ngrams (set to length of sequence by default)
Return type: iter(tuple)
-
nltk.util.
flatten
(*args)[source]¶ Flatten a list.
>>> from nltk.util import flatten >>> flatten(1, 2, ['b', 'a' , ['c', 'd']], 3) [1, 2, 'b', 'a', 'c', 'd', 3]
Parameters: args – items and lists to be combined into a single list Return type: list
-
nltk.util.
guess_encoding
(data)[source]¶ Given a byte string, attempt to decode it. Tries the standard ‘UTF8’ and ‘latin-1’ encodings, Plus several gathered from locale information.
The calling program must first call:
locale.setlocale(locale.LC_ALL, '')
If successful it returns
(decoded_unicode, successful_encoding)
. If unsuccessful it raises aUnicodeError
.
-
nltk.util.
in_idle
()[source]¶ Return True if this function is run within idle. Tkinter programs that are run in idle should never call
Tk.mainloop
; so this function should be used to gate all calls toTk.mainloop
.Warning: This function works by checking sys.stdin
. If the user has modifiedsys.stdin
, then it may return incorrect results.Return type: bool
-
nltk.util.
invert_graph
(graph)[source]¶ Inverts a directed graph.
Parameters: graph (dict(set)) – the graph, represented as a dictionary of sets Returns: the inverted graph Return type: dict(set)
-
nltk.util.
ngrams
(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]¶ Return the ngrams generated from a sequence of items, as an iterator. For example:
>>> from nltk.util import ngrams >>> list(ngrams([1,2,3,4,5], 3)) [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Wrap with list for a list version of this function. Set pad_left or pad_right to true in order to get additional ngrams:
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True)) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] >>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>')) [(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')] >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>')) [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)] >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')) [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
Parameters: - sequence (sequence or iter) – the source data to be converted into ngrams
- n (int) – the degree of the ngrams
- pad_left (bool) – whether the ngrams should be left-padded
- pad_right (bool) – whether the ngrams should be right-padded
- left_pad_symbol (any) – the symbol to use for left padding (default is None)
- right_pad_symbol (any) – the symbol to use for right padding (default is None)
Return type: sequence or iter
-
nltk.util.
pad_sequence
(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]¶ Returns a padded sequence of items before ngram extraction.
>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')) ['<s>', 1, 2, 3, 4, 5, '</s>'] >>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>')) ['<s>', 1, 2, 3, 4, 5] >>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>')) [1, 2, 3, 4, 5, '</s>']
Parameters: - sequence (sequence or iter) – the source data to be padded
- n (int) – the degree of the ngrams
- pad_left (bool) – whether the ngrams should be left-padded
- pad_right (bool) – whether the ngrams should be right-padded
- left_pad_symbol (any) – the symbol to use for left padding (default is None)
- right_pad_symbol (any) – the symbol to use for right padding (default is None)
Return type: sequence or iter
-
nltk.util.
pr
(data, start=0, end=None)[source]¶ Pretty print a sequence of data items
Parameters: - data (sequence or iter) – the data stream to print
- start (int) – the start position
- end (int) – the end position
-
nltk.util.
print_string
(s, width=70)[source]¶ Pretty print a string, breaking lines on whitespace
Parameters: - s (str) – the string to print, consisting of words and spaces
- width (int) – the display width
-
nltk.util.
re_show
(regexp, string, left='{', right='}')[source]¶ Return a string with markers surrounding the matched substrings. Search str for substrings matching
regexp
and wrap the matches with braces. This is convenient for learning about regular expressions.Parameters: Return type:
-
nltk.util.
set_proxy
(proxy, user=None, password='')[source]¶ Set the HTTP proxy for Python to download through.
If
proxy
is None then tries to set proxy from environment or system settings.Parameters: - proxy – The HTTP proxy server to use. For example: ‘http://proxy.example.com:3128/‘
- user – The username to authenticate with. Use None to disable authentication.
- password – The password to authenticate with.
-
nltk.util.
skipgrams
(sequence, n, k, **kwargs)[source]¶ Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
>>> sent = "Insurgents killed in ongoing fighting".split() >>> list(skipgrams(sent, 2, 2)) [('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')] >>> list(skipgrams(sent, 3, 2)) [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
Parameters: - sequence (sequence or iter) – the source data to be converted into trigrams
- n (int) – the degree of the ngrams
- k (int) – the skip distance
Return type: iter(tuple)
-
nltk.util.
tokenwrap
(tokens, separator=' ', width=70)[source]¶ Pretty print a list of text tokens, breaking lines on whitespace
Parameters:
-
nltk.util.
transitive_closure
(graph, reflexive=False)[source]¶ Calculate the transitive closure of a directed graph, optionally the reflexive transitive closure.
The algorithm is a slight modification of the “Marking Algorithm” of Ioannidis & Ramakrishnan (1998) “Efficient Transitive Closure Algorithms”.
Parameters: Return type:
-
nltk.util.
trigrams
(sequence, **kwargs)[source]¶ Return the trigrams generated from a sequence of items, as an iterator. For example:
>>> from nltk.util import trigrams >>> list(trigrams([1,2,3,4,5])) [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Use trigrams for a list version of this function.
Parameters: sequence (sequence or iter) – the source data to be converted into trigrams Return type: iter(tuple)
wsd
Module¶
-
nltk.wsd.
lesk
(context_sentence, ambiguous_word, pos=None, synsets=None)[source]¶ Return a synset for an ambiguous word in a context.
Parameters: context_sentence (iter) – The context sentence where the ambiguous word occurs, passed as an iterable of words. :param str ambiguous_word: The ambiguous word that requires WSD. :param str pos: A specified Part-of-Speech (POS). :param iter synsets: Possible synsets of the ambiguous word. :return:
lesk_sense
The Synset() object with the highest signature overlaps.This function is an implementation of the original Lesk algorithm (1986) [1].
Usage example:
>>> lesk(['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.'], 'bank', 'n') Synset('savings_bank.n.02')
[1] Lesk, Michael. “Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone.” Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, 1986. http://dl.acm.org/citation.cfm?id=318728
Subpackages¶
- nltk.app package
- nltk.ccg package
- nltk.chat package
- nltk.chunk package
- nltk.classify package
- Submodules
- nltk.classify.api module
- nltk.classify.decisiontree module
- nltk.classify.maxent module
- nltk.classify.megam module
- nltk.classify.naivebayes module
- nltk.classify.positivenaivebayes module
- nltk.classify.rte_classify module
- nltk.classify.scikitlearn module
- nltk.classify.senna module
- nltk.classify.svm module
- nltk.classify.tadm module
- nltk.classify.textcat module
- nltk.classify.util module
- nltk.classify.weka module
- Module contents
- nltk.cluster package
- nltk.corpus package
- Subpackages
- nltk.corpus.reader package
- Submodules
- nltk.corpus.reader.aligned module
- nltk.corpus.reader.api module
- nltk.corpus.reader.bnc module
- nltk.corpus.reader.bracket_parse module
- nltk.corpus.reader.categorized_sents module
- nltk.corpus.reader.chasen module
- nltk.corpus.reader.childes module
- nltk.corpus.reader.chunked module
- nltk.corpus.reader.cmudict module
- nltk.corpus.reader.comparative_sents module
- nltk.corpus.reader.conll module
- nltk.corpus.reader.crubadan module
- nltk.corpus.reader.dependency module
- nltk.corpus.reader.framenet module
- nltk.corpus.reader.ieer module
- nltk.corpus.reader.indian module
- nltk.corpus.reader.ipipan module
- nltk.corpus.reader.knbc module
- nltk.corpus.reader.lin module
- nltk.corpus.reader.mte module
- nltk.corpus.reader.nkjp module
- nltk.corpus.reader.nombank module
- nltk.corpus.reader.nps_chat module
- nltk.corpus.reader.opinion_lexicon module
- nltk.corpus.reader.panlex_lite module
- nltk.corpus.reader.pl196x module
- nltk.corpus.reader.plaintext module
- nltk.corpus.reader.ppattach module
- nltk.corpus.reader.propbank module
- nltk.corpus.reader.pros_cons module
- nltk.corpus.reader.reviews module
- nltk.corpus.reader.rte module
- nltk.corpus.reader.semcor module
- nltk.corpus.reader.senseval module
- nltk.corpus.reader.sentiwordnet module
- nltk.corpus.reader.sinica_treebank module
- nltk.corpus.reader.string_category module
- nltk.corpus.reader.switchboard module
- nltk.corpus.reader.tagged module
- nltk.corpus.reader.timit module
- nltk.corpus.reader.toolbox module
- nltk.corpus.reader.twitter module
- nltk.corpus.reader.udhr module
- nltk.corpus.reader.util module
- nltk.corpus.reader.verbnet module
- nltk.corpus.reader.wordlist module
- nltk.corpus.reader.wordnet module
- nltk.corpus.reader.xmldocs module
- nltk.corpus.reader.ycoe module
- Module contents
- nltk.corpus.reader package
- Submodules
- nltk.corpus.europarl_raw module
- nltk.corpus.util module
- Module contents
- Subpackages
- nltk.draw package
- nltk.inference package
- nltk.metrics package
- nltk.misc package
- nltk.parse package
- Submodules
- nltk.parse.api module
- nltk.parse.bllip module
- nltk.parse.chart module
- nltk.parse.corenlp module
- nltk.parse.dependencygraph module
- nltk.parse.earleychart module
- nltk.parse.evaluate module
- nltk.parse.featurechart module
- nltk.parse.generate module
- nltk.parse.malt module
- nltk.parse.nonprojectivedependencyparser module
- nltk.parse.pchart module
- nltk.parse.projectivedependencyparser module
- nltk.parse.recursivedescent module
- nltk.parse.shiftreduce module
- nltk.parse.stanford module
- nltk.parse.transitionparser module
- nltk.parse.util module
- nltk.parse.viterbi module
- Module contents
- nltk.sem package
- Submodules
- nltk.sem.boxer module
- nltk.sem.chat80 module
- nltk.sem.cooper_storage module
- nltk.sem.drt module
- nltk.sem.drt_glue_demo module
- nltk.sem.evaluate module
- nltk.sem.glue module
- nltk.sem.hole module
- nltk.sem.lfg module
- nltk.sem.linearlogic module
- nltk.sem.logic module
- nltk.sem.relextract module
- nltk.sem.skolemize module
- nltk.sem.util module
- Module contents
- nltk.stem package
- nltk.tag package
- Submodules
- nltk.tag.api module
- nltk.tag.brill module
- nltk.tag.brill_trainer module
- nltk.tag.crf module
- nltk.tag.hmm module
- nltk.tag.hunpos module
- nltk.tag.mapping module
- nltk.tag.perceptron module
- nltk.tag.senna module
- nltk.tag.sequential module
- nltk.tag.stanford module
- nltk.tag.tnt module
- nltk.tag.util module
- Module contents
- nltk.test package
- Subpackages
- nltk.test.unit package
- Subpackages
- nltk.test.unit.translate package
- Submodules
- nltk.test.unit.translate.test_bleu module
- nltk.test.unit.translate.test_ibm1 module
- nltk.test.unit.translate.test_ibm2 module
- nltk.test.unit.translate.test_ibm3 module
- nltk.test.unit.translate.test_ibm4 module
- nltk.test.unit.translate.test_ibm5 module
- nltk.test.unit.translate.test_ibm_model module
- nltk.test.unit.translate.test_stack_decoder module
- Module contents
- nltk.test.unit.translate package
- Submodules
- nltk.test.unit.test_2x_compat module
- nltk.test.unit.test_aline module
- nltk.test.unit.test_chunk module
- nltk.test.unit.test_classify module
- nltk.test.unit.test_collocations module
- nltk.test.unit.test_corpora module
- nltk.test.unit.test_corpus_views module
- nltk.test.unit.test_hmm module
- nltk.test.unit.test_json2csv_corpus module
- nltk.test.unit.test_naivebayes module
- nltk.test.unit.test_seekable_unicode_stream_reader module
- nltk.test.unit.test_senna module
- nltk.test.unit.test_stem module
- nltk.test.unit.test_tag module
- nltk.test.unit.test_tgrep module
- nltk.test.unit.test_tokenize module
- nltk.test.unit.test_twitter_auth module
- nltk.test.unit.utils module
- Module contents
- Subpackages
- nltk.test.unit package
- Submodules
- nltk.test.all module
- nltk.test.childes_fixt module
- nltk.test.classify_fixt module
- nltk.test.compat_fixt module
- nltk.test.corpus_fixt module
- nltk.test.discourse_fixt module
- nltk.test.doctest_nose_plugin module
- nltk.test.gensim_fixt module
- nltk.test.gluesemantics_malt_fixt module
- nltk.test.inference_fixt module
- nltk.test.nonmonotonic_fixt module
- nltk.test.portuguese_en_fixt module
- nltk.test.probability_fixt module
- nltk.test.runtests module
- nltk.test.segmentation_fixt module
- nltk.test.semantics_fixt module
- nltk.test.translate_fixt module
- nltk.test.wordnet_fixt module
- Module contents
- Subpackages
- nltk.tokenize package
- Submodules
- nltk.tokenize.api module
- nltk.tokenize.casual module
- nltk.tokenize.moses module
- nltk.tokenize.mwe module
- nltk.tokenize.punkt module
- nltk.tokenize.regexp module
- nltk.tokenize.repp module
- nltk.tokenize.sexpr module
- nltk.tokenize.simple module
- nltk.tokenize.stanford module
- nltk.tokenize.stanford_segmenter module
- nltk.tokenize.texttiling module
- nltk.tokenize.toktok module
- nltk.tokenize.treebank module
- nltk.tokenize.util module
- Module contents