nltk.corpus.reader package

Submodules

nltk.corpus.reader.aligned module

class nltk.corpus.reader.aligned.AlignedCorpusReader(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.

aligned_sents(fileids=None)[source]
Returns:the given file(s) as a list of AlignedSent objects.
Return type:list(AlignedSent)
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.aligned.AlignedSentCorpusView(corpus_file, encoding, aligned, group_by_sent, word_tokenizer, sent_tokenizer, alignedsent_block_reader)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A specialized corpus view for aligned sentences. AlignedSentCorpusView objects are typically created by AlignedCorpusReader (not directly by nltk users).

read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream

nltk.corpus.reader.api module

API for corpus readers.

class nltk.corpus.reader.api.CategorizedCorpusReader(kwargs)[source]

Bases: object

A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method categories(), which returns a list of the categories for the corpus or for a specified set of fileids; and overrides fileids() to take a categories argument, restricting the set of fileids to be returned.

Subclasses are expected to:

  • Call __init__() to set up the mapping.
  • Override all view methods to accept a categories parameter, which can be used instead of the fileids parameter, to select which fileids should be included in the returned view.
categories(fileids=None)[source]

Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.

fileids(categories=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.

class nltk.corpus.reader.api.CorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: object

A base class for “corpus reader” classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory. Each file is identified by its file identifier, which is the relative path to the file from the root directory.

A separate subclass is defined for each corpus format. These subclasses define one or more methods that provide ‘views’ on the corpus contents, such as words() (for a list of words) and parsed_sents() (for a list of parsed sentences). Called with no arguments, these methods will return the contents of the entire corpus. For most corpora, these methods define one or more selection arguments, such as fileids or categories, which can be used to select which portion of the corpus should be returned.

abspath(fileid)[source]

Return the absolute path for the given file.

Parameters:fileid (str) – The file identifier for the file whose path should be returned.
Return type:PathPointer
abspaths(fileids=None, include_encoding=False, include_fileid=False)[source]

Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.

Parameters:
  • fileids (None or str or list) – Specifies the set of fileids for which paths should be returned. Can be None, for all fileids; a list of file identifiers, for a specified set of fileids; or a single file identifier, for a single file. Note that the return value is always a list of paths, even if fileids is a single file identifier.
  • include_encoding – If true, then return a list of (path_pointer, encoding) tuples.
Return type:

list(PathPointer)

citation()[source]

Return the contents of the corpus citation.bib file, if it exists.

encoding(file)[source]

Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.

ensure_loaded()[source]

Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).

fileids()[source]

Return a list of file identifiers for the fileids that make up this corpus.

license()[source]

Return the contents of the corpus LICENSE file, if it exists.

open(file)[source]

Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.

Parameters:file – The file identifier of the file to read.
readme()[source]

Return the contents of the corpus README file, if it exists.

root

The directory where this corpus is stored.

Type:PathPointer
unicode_repr()

Return repr(self).

class nltk.corpus.reader.api.SyntaxCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:

  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
  • _read_block, which reads a block from the input stream.
  • _word, which takes a block and returns a list of list of words.
  • _tag, which takes a block and returns a list of list of tagged words.
  • _parse, which takes a block and returns a list of parsed sentences.
parsed_sents(fileids=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]

nltk.corpus.reader.bnc module

Corpus reader for the XML version of the British National Corpus.

class nltk.corpus.reader.bnc.BNCCorpusReader(root, fileids, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the British National Corpus.

For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554

If you extracted the archive to a directory called BNC, then you can instantiate the reader as:

BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
sents(fileids=None, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_words(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
words(fileids=None, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
class nltk.corpus.reader.bnc.BNCSentence(num, items)[source]

Bases: list

A list of words, augmented by an attribute num used to record the sentence identifier (the n attribute from the XML).

class nltk.corpus.reader.bnc.BNCWordView(fileid, sent, tag, strip_space, stem)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with the BNC corpus.

author = None

Author of the document.

editor = None

Editor

handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:
  • elt (ElementTree) – The element that should be converted.
  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
handle_header(elt, context)[source]
handle_sent(elt)[source]
handle_word(elt)[source]
resps = None

Statement of responsibility

tags_to_ignore = {'gap', 'pb', 'unclear', 'align', 'event', 'vocal', 'shift', 'pause'}

These tags are ignored. For their description refer to the technical documentation, for example, http://www.natcorp.ox.ac.uk/docs/URG/ref-vocal.html

title = None

Title of the document.

nltk.corpus.reader.bracket_parse module

Corpus reader for corpora that consist of parenthesis-delineated parse trees.

class nltk.corpus.reader.bracket_parse.AlpinoCorpusReader(root, encoding='ISO-8859-1', tagset=None)[source]

Bases: nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter ‘ordered’ to the overridden _normalize function. The _parse function can then remain untouched.

class nltk.corpus.reader.bracket_parse.BracketParseCorpusReader(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the “combined” section of the Penn Treebank, e.g. “(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))”.

class nltk.corpus.reader.bracket_parse.CategorizedBracketParseCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>

paras(fileids=None, categories=None)[source]
parsed_paras(fileids=None, categories=None)[source]
parsed_sents(fileids=None, categories=None)[source]
parsed_words(fileids=None, categories=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None)[source]
tagged_paras(fileids=None, categories=None, tagset=None)[source]
tagged_sents(fileids=None, categories=None, tagset=None)[source]
tagged_words(fileids=None, categories=None, tagset=None)[source]
words(fileids=None, categories=None)[source]

nltk.corpus.reader.categorized_sents module

CorpusReader structured for corpora that contain one instance on each row. This CorpusReader is specifically used for the Subjectivity Dataset and the Sentence Polarity Dataset.

  • Subjectivity Dataset information -

Authors: Bo Pang and Lillian Lee. Url: http://www.cs.cornell.edu/people/pabo/movie-review-data

Distributed with permission.

Related papers:

  • Bo Pang and Lillian Lee. “A Sentimental Education: Sentiment Analysis Using
    Subjectivity Summarization Based on Minimum Cuts”. Proceedings of the ACL, 2004.
  • Sentence Polarity Dataset information -

Authors: Bo Pang and Lillian Lee. Url: http://www.cs.cornell.edu/people/pabo/movie-review-data

Related papers:

  • Bo Pang and Lillian Lee. “Seeing stars: Exploiting class relationships for
    sentiment categorization with respect to rating scales”. Proceedings of the ACL, 2005.
class nltk.corpus.reader.categorized_sents.CategorizedSentencesCorpusReader(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), sent_tokenizer=None, encoding='utf8', **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.api.CorpusReader

A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.

Examples using the Subjectivity Dataset:

>>> from nltk.corpus import subjectivity
>>> subjectivity.sents()[23]
['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits',
'happened', 'off', 'screen', '.']
>>> subjectivity.categories()
['obj', 'subj']
>>> subjectivity.words(categories='subj')
['smart', 'and', 'alert', ',', 'thirteen', ...]

Examples using the Sentence Polarity Dataset:

>>> from nltk.corpus import sentence_polarity
>>> sentence_polarity.sents()
[['simplistic', ',', 'silly', 'and', 'tedious', '.'], ["it's", 'so', 'laddish',
'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find',
'it', 'funny', '.'], ...]
>>> sentence_polarity.categories()
['neg', 'pos']
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

raw(fileids=None, categories=None)[source]
Parameters:
  • fileids – a list or regexp specifying the fileids that have to be returned as a raw string.
  • categories – a list specifying the categories whose files have to be returned as a raw string.
Returns:

the given file(s) as a single string.

Return type:

str

readme()[source]

Return the contents of the corpus Readme.txt file.

sents(fileids=None, categories=None)[source]

Return all sentences in the corpus or in the specified file(s).

Parameters:
  • fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
  • categories – a list specifying the categories whose sentences have to be returned.
Returns:

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type:

list(list(str))

words(fileids=None, categories=None)[source]

Return all words and punctuation symbols in the corpus or in the specified file(s).

Parameters:
  • fileids – a list or regexp specifying the ids of the files whose words have to be returned.
  • categories – a list specifying the categories whose words have to be returned.
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

nltk.corpus.reader.chasen module

class nltk.corpus.reader.chasen.ChasenCorpusReader(root, fileids, encoding='utf8', sent_splitter=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

paras(fileids=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_paras(fileids=None)[source]
tagged_sents(fileids=None)[source]
tagged_words(fileids=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.chasen.ChasenCorpusView(corpus_file, encoding, tagged, group_by_sent, group_by_para, sent_splitter=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A specialized corpus view for ChasenReader. Similar to TaggedCorpusView, but this’ll use fixed sets of word and sentence tokenizer.

read_block(stream)[source]

Reads one paragraph at a time.

nltk.corpus.reader.chasen.demo()[source]
nltk.corpus.reader.chasen.test()[source]

nltk.corpus.reader.childes module

Corpus reader for the XML version of the CHILDES corpus.

class nltk.corpus.reader.childes.CHILDESCorpusReader(root, fileids, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at http://childes.psy.cmu.edu/. The XML version of CHILDES is located at http://childes.psy.cmu.edu/data-xml/. Copy the needed parts of the CHILDES XML corpus into the NLTK data directory (nltk_data/corpora/CHILDES/).

For access to the file text use the usual nltk functions, words(), sents(), tagged_words() and tagged_sents().

MLU(fileids=None, speaker='CHI')[source]
Returns:the given file(s) as a floating number
Return type:list(float)
age(fileids=None, speaker='CHI', month=False)[source]
Returns:the given file(s) as string or int
Return type:list or int
Parameters:month – If true, return months instead of year-month-date
childes_url_base = 'http://childes.psy.cmu.edu/browser/index.php?url='
convert_age(age_year)[source]

Caclculate age in months from a string in CHILDES format

corpus(fileids=None)[source]
Returns:the given file(s) as a dict of (corpus_property_key, value)
Return type:list(dict)
participants(fileids=None)[source]
Returns:the given file(s) as a dict of (participant_property_key, value)
Return type:list(dict)
sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
tagged_sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
tagged_words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (stem, index, dependent_index)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
webview_file(fileid, urlbase=None)[source]

Map a corpus file to its web version on the CHILDES website, and open it in a web browser.

The complete URL to be used is:
childes.childes_url_base + urlbase + fileid.replace(‘.xml’, ‘.cha’)

If no urlbase is passed, we try to calculate it. This requires that the childes corpus was set up to mirror the folder hierarchy under childes.psy.cmu.edu/data-xml/, e.g.: nltk_data/corpora/childes/Eng-USA/Cornell/??? or nltk_data/corpora/childes/Romance/Spanish/Aguirre/???

The function first looks (as a special case) if “Eng-USA” is on the path consisting of <corpus root>+fileid; then if “childes”, possibly followed by “data-xml”, appears. If neither one is found, we use the unmodified fileid and hope for the best. If this is not right, specify urlbase explicitly, e.g., if the corpus root points to the Cornell folder, urlbase=’Eng-USA/Cornell’.

words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of words

Return type:

list(str)

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (stem, index, dependent_index)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
nltk.corpus.reader.childes.demo(corpus_root=None)[source]

The CHILDES corpus should be manually downloaded and saved to [NLTK_Data_Dir]/corpora/childes/

nltk.corpus.reader.chunked module

A reader for corpora that contain chunked (and optionally tagged) documents.

class nltk.corpus.reader.chunked.ChunkedCorpusReader(root, fileids, extension='', str2chunktree=<function tagstr2tree>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using nltk.chunk.tagstr2tree.

chunked_paras(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).
Return type:list(list(Tree))
chunked_sents(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).
Return type:list(Tree)
chunked_words(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and chunks. Words are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over (word,tag) tuples or word strings.
Return type:list(tuple(str,str) and Tree)
paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.chunked.ChunkedCorpusView(fileid, encoding, tagged, group_by_sent, group_by_para, chunked, str2chunktree, sent_tokenizer, para_block_reader, source_tagset=None, target_tagset=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream

nltk.corpus.reader.cmudict module

The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6] ftp://ftp.cs.cmu.edu/project/speech/dict/ Copyright 1998 Carnegie Mellon University

File Format: Each line consists of an uppercased word, a counter (for alternative pronunciations), and a transcription. Vowels are marked for stress (1=primary, 2=secondary, 0=no stress). E.g.: NATURAL 1 N AE1 CH ER0 AH0 L

The dictionary contains 127069 entries. Of these, 119400 words are assigned a unique pronunciation, 6830 words have two pronunciations, and 839 words have three or more pronunciations. Many of these are fast-speech variants.

Phonemes: There are 39 phonemes, as shown below:

Phoneme Example Translation Phoneme Example Translation ——- ——- ———– ——- ——- ———– AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY K key K IY L lee L IY M me M IY N knee N IY NG ping P IH NG OW oat OW T OY toy T OY P pee P IY R read R IY D S sea S IY SH she SH IY T tea T IY TH theta TH EY T AH UH hood HH UH D UW two T UW V vee V IY W we W IY Y yield Y IY L D Z zee Z IY ZH seizure S IY ZH ER

class nltk.corpus.reader.cmudict.CMUDictCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

dict()[source]
Returns:the cmudict lexicon as a dictionary, whose keys are

lowercase words and whose values are lists of pronunciations.

entries()[source]
Returns:the cmudict lexicon as a list of entries

containing (word, transcriptions) tuples.

raw()[source]
Returns:the cmudict lexicon as a raw string.
words()[source]
Returns:a list of all words defined in the cmudict lexicon.
nltk.corpus.reader.cmudict.read_cmudict_block(stream)[source]

nltk.corpus.reader.comparative_sents module

CorpusReader for the Comparative Sentence Dataset.

  • Comparative Sentence Dataset information -
Annotated by: Nitin Jindal and Bing Liu, 2006.
Department of Computer Sicence University of Illinois at Chicago
Contact: Nitin Jindal, njindal@cs.uic.edu
Bing Liu, liub@cs.uic.edu (http://www.cs.uic.edu/~liub)

Distributed with permission.

Related papers:

  • Nitin Jindal and Bing Liu. “Identifying Comparative Sentences in Text Documents”.
    Proceedings of the ACM SIGIR International Conference on Information Retrieval (SIGIR-06), 2006.
  • Nitin Jindal and Bing Liu. “Mining Comprative Sentences and Relations”.
    Proceedings of Twenty First National Conference on Artificial Intelligence (AAAI-2006), 2006.
  • Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.
    Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.
class nltk.corpus.reader.comparative_sents.ComparativeSentencesCorpusReader(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), sent_tokenizer=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).

>>> from nltk.corpus import comparative_sentences
>>> comparison = comparative_sentences.comparisons()[0]
>>> comparison.text
['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly',
'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve",
'had', '.']
>>> comparison.entity_2
'models'
>>> (comparison.feature, comparison.keyword)
('rewind', 'more')
>>> len(comparative_sentences.comparisons())
853
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

comparisons(fileids=None)[source]

Return all comparisons in the corpus.

Parameters:fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned.
Returns:the given file(s) as a list of Comparison objects.
Return type:list(Comparison)
keywords(fileids=None)[source]

Return a set of all keywords used in the corpus.

Parameters:fileids – a list or regexp specifying the ids of the files whose keywords have to be returned.
Returns:the set of keywords and comparative phrases used in the corpus.
Return type:set(str)
keywords_readme()[source]

Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).

raw(fileids=None)[source]
Parameters:fileids – a list or regexp specifying the fileids that have to be returned as a raw string.
Returns:the given file(s) as a single string.
Return type:str
readme()[source]

Return the contents of the corpus readme file.

sents(fileids=None)[source]

Return all sentences in the corpus.

Parameters:fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
Returns:all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified).
Return type:list(list(str)) or list(str)
words(fileids=None)[source]

Return all words and punctuation symbols in the corpus.

Parameters:fileids – a list or regexp specifying the ids of the files whose words have to be returned.
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.comparative_sents.Comparison(text=None, comp_type=None, entity_1=None, entity_2=None, feature=None, keyword=None)[source]

Bases: object

A Comparison represents a comparative sentence and its constituents.

nltk.corpus.reader.conll module

Read CoNLL-style chunk fileids.

class nltk.corpus.reader.conll.ConllChunkCorpusReader(root, fileids, chunk_types, encoding='utf8', tagset=None, separator=None)[source]

Bases: nltk.corpus.reader.conll.ConllCorpusReader

A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.

class nltk.corpus.reader.conll.ConllCorpusReader(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.Tree'>, tagset=None, separator=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the ConllCorpusReader constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus. By default columns are split by consecutive whitespaces, with the separator argument you can set a string to split by (e.g. ' ').

@todo: Add support for reading from corpora where different
parallel files contain different columns.
@todo: Possibly add caching of the grid corpus view? This would
allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
@todo: Better support for -DOCSTART-. Currently, we just ignore
it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).
CHUNK = 'chunk'

column type for chunk structures

COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')

A list of all column types supported by the conll corpus reader.

IGNORE = 'ignore'

column type for column that should be ignored

NE = 'ne'

column type for named entities

POS = 'pos'

column type for part-of-speech tags

SRL = 'srl'

column type for semantic role labels

TREE = 'tree'

column type for parse trees

WORDS = 'words'

column type for words

chunked_sents(fileids=None, chunk_types=None, tagset=None)[source]
chunked_words(fileids=None, chunk_types=None, tagset=None)[source]
iob_sents(fileids=None, tagset=None)[source]
Returns:a list of lists of word/tag/IOB tuples
Return type:list(list)
Parameters:fileids (None or str or list) – the list of fileids that make up this corpus
iob_words(fileids=None, tagset=None)[source]
Returns:a list of word/tag/IOB tuples
Return type:list(tuple)
Parameters:fileids (None or str or list) – the list of fileids that make up this corpus
parsed_sents(fileids=None, pos_in_tree=None, tagset=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
srl_instances(fileids=None, pos_in_tree=None, flatten=True)[source]
srl_spans(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.conll.ConllSRLInstance(tree, verb_head, verb_stem, roleset, tagged_spans)[source]

Bases: object

An SRL instance from a CoNLL corpus, which identifies and providing labels for the arguments of a single verb.

arguments = None

A list of (argspan, argid) tuples, specifying the location and type for each of the arguments identified by this instance. argspan is a tuple start, end, indicating that the argument consists of the words[start:end].

pprint()[source]
tagged_spans = None

A list of (span, id) tuples, specifying the location and type for each of the arguments, as well as the verb pieces, that make up this instance.

tree = None

The parse tree for the sentence containing this instance.

unicode_repr()

Return repr(self).

verb = None

A list of the word indices of the words that compose the verb whose arguments are identified by this instance. This will contain multiple word indices when multi-word verbs are used (e.g. ‘turn on’).

verb_head = None

The word index of the head word of the verb whose arguments are identified by this instance. E.g., for a sentence that uses the verb ‘turn on,’ verb_head will be the word index of the word ‘turn’.

words = None

A list of the words in the sentence containing this instance.

class nltk.corpus.reader.conll.ConllSRLInstanceList(tree, instances=())[source]

Bases: list

Set of instances for a single sentence

pprint(include_tree=False)[source]
unicode_repr

Return repr(self).

nltk.corpus.reader.crubadan module

An NLTK interface for the n-gram statistics gathered from the corpora for each language using An Crubadan.

There are multiple potential applications for the data but this reader was created with the goal of using it in the context of language identification.

For details about An Crubadan, this data, and its potential uses, see: http://borel.slu.edu/crubadan/index.html

class nltk.corpus.reader.crubadan.CrubadanCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader used to access language An Crubadan n-gram files.

crubadan_to_iso(lang)[source]

Return ISO 639-3 code given internal Crubadan code

iso_to_crubadan(lang)[source]

Return internal Crubadan code based on ISO 639-3 code

lang_freq(lang)[source]

Return n-gram FreqDist for a specific language given ISO 639-3 language code

langs()[source]

Return a list of supported languages as ISO 639-3 codes

nltk.corpus.reader.dependency module

class nltk.corpus.reader.dependency.DependencyCorpusReader(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), para_block_reader=<function read_blankline_block>)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

parsed_sents(fileids=None)[source]
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
tagged_sents(fileids=None)[source]
tagged_words(fileids=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.dependency.DependencyCorpusView(corpus_file, tagged, group_by_sent, dependencies, chunk_types=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream

nltk.corpus.reader.framenet module

Corpus reader for the FrameNet 1.7 lexicon and corpus.

class nltk.corpus.reader.framenet.AttrDict(*args, **kwargs)[source]

Bases: dict

A class that wraps a dict and allows accessing the keys of the dict as if they were attributes. Taken from here:

>>> foo = {'a':1, 'b':2, 'c':3}
>>> bar = AttrDict(foo)
>>> pprint(dict(bar))
{'a': 1, 'b': 2, 'c': 3}
>>> bar.b
2
>>> bar.d = 4
>>> pprint(dict(bar))
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
unicode_repr()

Return repr(self).

class nltk.corpus.reader.framenet.FramenetCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

A corpus reader for the Framenet Corpus.

>>> from nltk.corpus import framenet as fn
>>> fn.lu(3238).frame.lexUnit['glint.v'] is fn.lu(3238)
True
>>> fn.frame_by_name('Replacing') is fn.lus('replace.v')[0].frame
True
>>> fn.lus('prejudice.n')[0].frame.frameRelations == fn.frame_relations('Partiality')
True
annotations(luNamePattern=None, exemplars=True, full_text=True)[source]

Frame annotation sets matching the specified criteria.

buildindexes()[source]

Build the internal indexes to make look-ups faster.

doc(fn_docid)[source]

Returns the annotated document whose id number is fn_docid. This id number can be obtained by calling the Documents() function.

The dict that is returned from this function will contain the following keys:

  • ‘_type’ : ‘fulltextannotation’

  • ‘sentence’ : a list of sentences in the document
    • Each item in the list is a dict containing the following keys:
      • ‘ID’ : the ID number of the sentence

      • ‘_type’ : ‘sentence’

      • ‘text’ : the text of the sentence

      • ‘paragNo’ : the paragraph number

      • ‘sentNo’ : the sentence number

      • ‘docID’ : the document ID number

      • ‘corpID’ : the corpus ID number

      • ‘aPos’ : the annotation position

      • ‘annotationSet’ : a list of annotation layers for the sentence
        • Each item in the list is a dict containing the following keys:
          • ‘ID’ : the ID number of the annotation set

          • ‘_type’ : ‘annotationset’

          • ‘status’ : either ‘MANUAL’ or ‘UNANN’

          • ‘luName’ : (only if status is ‘MANUAL’)

          • ‘luID’ : (only if status is ‘MANUAL’)

          • ‘frameID’ : (only if status is ‘MANUAL’)

          • ‘frameName’: (only if status is ‘MANUAL’)

          • ‘layer’ : a list of labels for the layer
            • Each item in the layer is a dict containing the following keys:

              • ‘_type’: ‘layer’
              • ‘rank’
              • ‘name’
              • ‘label’ : a list of labels in the layer
                • Each item is a dict containing the following keys:
                  • ‘start’
                  • ‘end’
                  • ‘name’
                  • ‘feID’ (optional)
Parameters:fn_docid (int) – The Framenet id number of the document
Returns:Information about the annotated document
Return type:dict
docs(name=None)[source]

Return a list of the annotated full-text documents in FrameNet, optionally filtered by a regex to be matched against the document name.

docs_metadata(name=None)[source]

Return an index of the annotated documents in Framenet.

Details for a specific annotated document can be obtained using this class’s doc() function and pass it the value of the ‘ID’ field.

>>> from nltk.corpus import framenet as fn
>>> len(fn.docs()) in (78, 107) # FN 1.5 and 1.7, resp.
True
>>> set([x.corpname for x in fn.docs_metadata()])>=set(['ANC', 'KBEval',                     'LUCorpus-v0.3', 'Miscellaneous', 'NTI', 'PropBank'])
True
Parameters:name (str) – A regular expression pattern used to search the file name of each annotated document. The document’s file name contains the name of the corpus that the document is from, followed by two underscores “__” followed by the document name. So, for example, the file name “LUCorpus-v0.3__20000410_nyt-NEW.xml” is from the corpus named “LUCorpus-v0.3” and the document name is “20000410_nyt-NEW.xml”.
Returns:A list of selected (or all) annotated documents
Return type:list of dicts, where each dict object contains the following keys:
  • ’name’
  • ’ID’
  • ’corpid’
  • ’corpname’
  • ’description’
  • ’filename’
exemplars(luNamePattern=None, frame=None, fe=None, fe2=None)[source]

Lexicographic exemplar sentences, optionally filtered by LU name and/or 1-2 FEs that are realized overtly. ‘frame’ may be a name pattern, frame ID, or frame instance. ‘fe’ may be a name pattern or FE instance; if specified, ‘fe2’ may also be specified to retrieve sentences with both overt FEs (in either order).

fe_relations()[source]

Obtain a list of frame element relations.

>>> from nltk.corpus import framenet as fn
>>> ferels = fn.fe_relations()
>>> isinstance(ferels, list)
True
>>> len(ferels) in (10020, 12393)   # FN 1.5 and 1.7, resp.
True
>>> PrettyDict(ferels[0], breakLines=True)
{'ID': 14642,
'_type': 'ferelation',
'frameRelation': <Parent=Abounding_with -- Inheritance -> Child=Lively_place>,
'subFE': <fe ID=11370 name=Degree>,
'subFEName': 'Degree',
'subFrame': <frame ID=1904 name=Lively_place>,
'subID': 11370,
'supID': 2271,
'superFE': <fe ID=2271 name=Degree>,
'superFEName': 'Degree',
'superFrame': <frame ID=262 name=Abounding_with>,
'type': <framerelationtype ID=1 name=Inheritance>}
Returns:A list of all of the frame element relations in framenet
Return type:list(dict)
fes(name=None, frame=None)[source]

Lists frame element objects. If ‘name’ is provided, this is treated as a case-insensitive regular expression to filter by frame name. (Case-insensitivity is because casing of frame element names is not always consistent across frames.) Specify ‘frame’ to filter by a frame name pattern, ID, or object.

>>> from nltk.corpus import framenet as fn
>>> fn.fes('Noise_maker')
[<fe ID=6043 name=Noise_maker>]
>>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound')])
[('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'),
 ('Make_noise', 'Sound_source'), ('Sound_movement', 'Location_of_sound_source'),
 ('Sound_movement', 'Sound'), ('Sound_movement', 'Sound_source'),
 ('Sounds', 'Component_sound'), ('Sounds', 'Location_of_sound_source'),
 ('Sounds', 'Sound_source'), ('Vocalizations', 'Location_of_sound_source'),
 ('Vocalizations', 'Sound_source')]
>>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound',r'(?i)make_noise')])
[('Cause_to_make_noise', 'Sound_maker'),
 ('Make_noise', 'Sound'),
 ('Make_noise', 'Sound_source')]
>>> sorted(set(fe.name for fe in fn.fes('^sound')))
['Sound', 'Sound_maker', 'Sound_source']
>>> len(fn.fes('^sound$'))
2
Parameters:name (str) – A regular expression pattern used to match against frame element names. If ‘name’ is None, then a list of all frame elements will be returned.
Returns:A list of matching frame elements
Return type:list(AttrDict)
frame(fn_fid_or_fname, ignorekeys=[])[source]

Get the details for the specified Frame using the frame’s name or id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame(256)
>>> f.name
'Medical_specialties'
>>> f = fn.frame('Medical_specialties')
>>> f.ID
256
>>> # ensure non-ASCII character in definition doesn't trigger an encoding error:
>>> fn.frame('Imposing_obligation')
frame (1494): Imposing_obligation...

The dict that is returned from this function will contain the following information about the Frame:

  • ‘name’ : the name of the Frame (e.g. ‘Birth’, ‘Apply_heat’, etc.)

  • ‘definition’ : textual definition of the Frame

  • ‘ID’ : the internal ID number of the Frame

  • ‘semTypes’ : a list of semantic types for this frame
    • Each item in the list is a dict containing the following keys:
      • ‘name’ : can be used with the semtype() function
      • ‘ID’ : can be used with the semtype() function
  • ‘lexUnit’ : a dict containing all of the LUs for this frame.

    The keys in this dict are the names of the LUs and the value for each key is itself a dict containing info about the LU (see the lu() function for more info.)

  • ‘FE’ : a dict containing the Frame Elements that are part of this frame

    The keys in this dict are the names of the FEs (e.g. ‘Body_system’) and the values are dicts containing the following keys

    • ‘definition’ : The definition of the FE
    • ‘name’ : The name of the FE e.g. ‘Body_system’
    • ‘ID’ : The id number
    • ‘_type’ : ‘fe’
    • ‘abbrev’ : Abbreviation e.g. ‘bod’
    • ‘coreType’ : one of “Core”, “Peripheral”, or “Extra-Thematic”
    • ‘semType’ : if not None, a dict with the following two keys:
      • ‘name’ : name of the semantic type. can be used with
        the semtype() function
      • ‘ID’ : id number of the semantic type. can be used with
        the semtype() function
    • ‘requiresFE’ : if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame
      • ‘ID’ : the id of the other FE in this frame
    • ‘excludesFE’ : if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame
      • ‘ID’ : the id of the other FE in this frame
  • ‘frameRelation’ : a list of objects describing frame relations

  • ‘FEcoreSets’ : a list of Frame Element core sets for this frame
    • Each item in the list is a list of FE objects
Parameters:
  • fn_fid_or_fname (int or str) – The Framenet name or id number of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

frame_by_id(fn_fid, ignorekeys=[])[source]

Get the details for the specified Frame using the frame’s id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_id(256)
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition
"This frame includes words that name ..."
Parameters:
  • fn_fid (int) – The Framenet id number of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_by_name(fn_fname, ignorekeys=[], check_cache=True)[source]

Get the details for the specified Frame using the frame’s name.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_name('Medical_specialties')
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition
"This frame includes words that name ..."
Parameters:
  • fn_fname (str) – The name of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_ids_and_names(name=None)[source]

Uses the frame index, which is much faster than looking up each frame definition if only the names and IDs are needed.

frame_relation_types()[source]

Obtain a list of frame relation types.

>>> from nltk.corpus import framenet as fn
>>> frts = sorted(fn.frame_relation_types(), key=itemgetter('ID'))
>>> isinstance(frts, list)
True
>>> len(frts) in (9, 10)    # FN 1.5 and 1.7, resp.
True
>>> PrettyDict(frts[0], breakLines=True)
{'ID': 1,
 '_type': 'framerelationtype',
 'frameRelations': [<Parent=Event -- Inheritance -> Child=Change_of_consistency>, <Parent=Event -- Inheritance -> Child=Rotting>, ...],
 'name': 'Inheritance',
 'subFrameName': 'Child',
 'superFrameName': 'Parent'}
Returns:A list of all of the frame relation types in framenet
Return type:list(dict)
frame_relations(frame=None, frame2=None, type=None)[source]
Parameters:frame – (optional) frame object, name, or ID; only relations involving

this frame will be returned :param frame2: (optional; ‘frame’ must be a different frame) only show relations between the two specified frames, in either direction :param type: (optional) frame relation type (name or object); show only relations of this type :type frame: int or str or AttrDict :return: A list of all of the frame relations in framenet :rtype: list(dict)

>>> from nltk.corpus import framenet as fn
>>> frels = fn.frame_relations()
>>> isinstance(frels, list)
True
>>> len(frels) in (1676, 2070)  # FN 1.5 and 1.7, resp.
True
>>> PrettyList(fn.frame_relations('Cooking_creation'), maxReprSize=0, breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>,
 <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
>>> PrettyList(fn.frame_relations(274), breakLines=True)
[<Parent=Avoiding -- Inheritance -> Child=Dodging>,
 <Parent=Avoiding -- Inheritance -> Child=Evading>, ...]
>>> PrettyList(fn.frame_relations(fn.frame('Cooking_creation')), breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>, ...]
>>> PrettyList(fn.frame_relations('Cooking_creation', type='Inheritance'))
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>]
>>> PrettyList(fn.frame_relations('Cooking_creation', 'Apply_heat'), breakLines=True)
[<Parent=Apply_heat -- Using -> Child=Cooking_creation>,
<MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
frames(name=None)[source]

Obtain details for a specific frame.

>>> from nltk.corpus import framenet as fn
>>> len(fn.frames()) in (1019, 1221)    # FN 1.5 and 1.7, resp.
True
>>> x = PrettyList(fn.frames(r'(?i)crim'), maxReprSize=0, breakLines=True)
>>> x.sort(key=itemgetter('ID'))
>>> x
[<frame ID=200 name=Criminal_process>,
 <frame ID=500 name=Criminal_investigation>,
 <frame ID=692 name=Crime_scenario>,
 <frame ID=700 name=Committing_crime>]

A brief intro to Frames (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and props that are needed for that Frame. For example, the “Apply_heat” frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.

We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called “lexical units” (LUs).

FrameNet includes relations between Frames. Several types of relations are defined, of which the most important are:

  • Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
  • Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to child FEs.
  • Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.
  • Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point of view, respectively.
Parameters:name (str) – A regular expression pattern used to match against Frame names. If ‘name’ is None, then a list of all Framenet Frames will be returned.
Returns:A list of matching Frames (or all Frames).
Return type:list(AttrDict)
frames_by_lemma(pat)[source]

Returns a list of all frames that contain LUs in which the name attribute of the LU matchs the given regular expression pat. Note that LU names are composed of “lemma.POS”, where the “lemma” part can be made up of either a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’).

Note: if you are going to be doing a lot of this type of searching, you’d want to build an index that maps from lemmas to frames because each time frames_by_lemma() is called, it has to search through ALL of the frame XML files in the db.

>>> from nltk.corpus import framenet as fn
>>> from nltk.corpus.reader.framenet import PrettyList
>>> PrettyList(sorted(fn.frames_by_lemma(r'(?i)a little'), key=itemgetter('ID'))) 
[<frame ID=189 name=Quanti...>, <frame ID=2001 name=Degree>]
Returns:A list of frame objects.
Return type:list(AttrDict)
ft_sents(docNamePattern=None)[source]

Full-text annotation sentences, optionally filtered by document name.

help(attrname=None)[source]

Display help information summarizing the main methods.

lu(fn_luid, ignorekeys=[], luName=None, frameID=None, frameName=None)[source]

Access a lexical unit by its ID. luName, frameID, and frameName are used only in the event that the LU does not have a file in the database (which is the case for LUs with “Problem” status); in this case, a placeholder LU is created which just contains its name, ID, and frame.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> fn.lu(256).name
'foresee.v'
>>> fn.lu(256).definition
'COD: be aware of beforehand; predict.'
>>> fn.lu(256).frame.name
'Expectation'
>>> pprint(list(map(PrettyDict, fn.lu(256).lexemes)))
[{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}]
>>> fn.lu(227).exemplars[23]
exemplar sentence (352962):
[sentNo] 0
[aPos] 59699508

[LU] (227) guess.v in Coming_to_believe

[frame] (23) Coming_to_believe

[annotationSet] 2 annotation sets

[POS] 18 tags

[POS_tagset] BNC

[GF] 3 relations

[PT] 3 phrases

[Other] 1 entry

[text] + [Target] + [FE]

When he was inside the house , Culley noticed the characteristic
                                              ------------------
                                              Content

he would n't have guessed at .
--                ******* --
Co                        C1 [Evidence:INI]
 (Co=Cognizer, C1=Content)

The dict that is returned from this function will contain most of the following information about the LU. Note that some LUs do not contain all of these pieces of information - particularly ‘totalAnnotated’ and ‘incorporatedFE’ may be missing in some LUs:

  • ‘name’ : the name of the LU (e.g. ‘merger.n’)

  • ‘definition’ : textual definition of the LU

  • ‘ID’ : the internal ID number of the LU

  • ‘_type’ : ‘lu’

  • ‘status’ : e.g. ‘Created’

  • ‘frame’ : Frame that this LU belongs to

  • ‘POS’ : the part of speech of this LU (e.g. ‘N’)

  • ‘totalAnnotated’ : total number of examples annotated with this LU

  • ‘incorporatedFE’ : FE that incorporates this LU (e.g. ‘Ailment’)

  • ‘sentenceCount’ : a dict with the following two keys:
    • ‘annotated’: number of sentences annotated with this LU
    • ‘total’ : total number of sentences with this LU
  • ‘lexemes’ : a list of dicts describing the lemma of this LU.

    Each dict in the list contains these keys: - ‘POS’ : part of speech e.g. ‘N’ - ‘name’ : either single-lexeme e.g. ‘merger’ or

    multi-lexeme e.g. ‘a little’

    • ‘order’: the order of the lexeme in the lemma (starting from 1)

    • ‘headword’: a boolean (‘true’ or ‘false’)

    • ‘breakBefore’: Can this lexeme be separated from the previous lexeme?
      Consider: “take over.v” as in:

      Germany took over the Netherlands in 2 days. Germany took the Netherlands over in 2 days.

      In this case, ‘breakBefore’ would be “true” for the lexeme “over”. Contrast this with “take after.v” as in:

      Mary takes after her grandmother.

      *Mary takes her grandmother after.

      In this case, ‘breakBefore’ would be “false” for the lexeme “after”

  • ‘lemmaID’ : Can be used to connect lemmas in different LUs

  • ‘semTypes’ : a list of semantic type objects for this LU

  • ‘subCorpus’ : a list of subcorpora
    • Each item in the list is a dict containing the following keys:
      • ‘name’ :
      • ‘sentence’ : a list of sentences in the subcorpus
        • each item in the list is a dict with the following keys:
          • ‘ID’:
          • ‘sentNo’:
          • ‘text’: the text of the sentence
          • ‘aPos’:
          • ‘annotationSet’: a list of annotation sets
            • each item in the list is a dict with the following keys:
              • ‘ID’:
              • ‘status’:
              • ‘layer’: a list of layers
                • each layer is a dict containing the following keys:
                  • ‘name’: layer name (e.g. ‘BNC’)
                  • ‘rank’:
                  • ‘label’: a list of labels for the layer
                    • each label is a dict containing the following keys:
                      • ‘start’: start pos of label in sentence ‘text’ (0-based)
                      • ‘end’: end pos of label in sentence ‘text’ (0-based)
                      • ‘name’: name of label (e.g. ‘NN1’)

Under the hood, this implementation looks up the lexical unit information in the frame definition file. That file does not contain corpus annotations, so the LU files will be accessed on demand if those are needed. In principle, valence patterns could be loaded here too, though these are not currently supported.

Parameters:
  • fn_luid (int) – The id number of the lexical unit
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

All information about the lexical unit

Return type:

dict

lu_basic(fn_luid)[source]

Returns basic information about the LU whose id is fn_luid. This is basically just a wrapper around the lu() function with “subCorpus” info excluded.

>>> from nltk.corpus import framenet as fn
>>> lu = PrettyDict(fn.lu_basic(256), breakLines=True)
>>> # ellipses account for differences between FN 1.5 and 1.7
>>> lu 
{'ID': 256,
 'POS': 'V',
 'URL': u'https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu256.xml',
 '_type': 'lu',
 'cBy': ...,
 'cDate': '02/08/2001 01:27:50 PST Thu',
 'definition': 'COD: be aware of beforehand; predict.',
 'definitionMarkup': 'COD: be aware of beforehand; predict.',
 'frame': <frame ID=26 name=Expectation>,
 'lemmaID': 15082,
 'lexemes': [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}],
 'name': 'foresee.v',
 'semTypes': [],
 'sentenceCount': {'annotated': ..., 'total': ...},
 'status': 'FN1_Sent'}
Parameters:fn_luid (int) – The id number of the desired LU
Returns:Basic information about the lexical unit
Return type:dict
lu_ids_and_names(name=None)[source]

Uses the LU index, which is much faster than looking up each LU definition if only the names and IDs are needed.

lus(name=None, frame=None)[source]

Obtain details for lexical units. Optionally restrict by lexical unit name pattern, and/or to a certain frame or frames whose name matches a pattern.

>>> from nltk.corpus import framenet as fn
>>> len(fn.lus()) in (11829, 13572) # FN 1.5 and 1.7, resp.
True
>>> PrettyList(sorted(fn.lus(r'(?i)a little'), key=itemgetter('ID')), maxReprSize=0, breakLines=True)
[<lu ID=14733 name=a little.n>,
 <lu ID=14743 name=a little.adv>,
 <lu ID=14744 name=a little bit.adv>]
>>> PrettyList(sorted(fn.lus(r'interest', r'(?i)stimulus'), key=itemgetter('ID')))
[<lu ID=14894 name=interested.a>, <lu ID=14920 name=interesting.a>]

A brief intro to Lexical Units (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat” Frame describes a common situation involving a Cook, some Food, and a Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of a polysemous word is a different LU.

We have used the word “word” in talking about LUs. The reality is actually rather complex. When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different frames:

  • Apply_heat: “Michelle baked the potatoes for 45 minutes.”
  • Cooking_creation: “Michelle baked her mother a cake for her birthday.”
  • Absorb_heat: “The potatoes have to bake for more than 30 minutes.”

These constitute three different LUs, with different definitions.

Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively), and their internal structure is not analyzed.

Framenet provides multiple annotated examples of each sense of a word (i.e. each LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial possibilities of the lexical unit.

Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database similar to a thesaurus, grouping together semantically similar words.

In the simplest case, frame-evoking words are verbs such as “fried” in:

“Matilde fried the catfish in a heavy iron skillet.”

Sometimes event nouns may evoke a Frame. For example, “reduction” evokes “Cause_change_of_scalar_position” in:

“…the reduction of debt levels to $665 million from $2.6 billion.”

Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as in:

“They were asleep for hours.”

Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents rather than clearly evoking their own frames.

Parameters:name (str) –

A regular expression pattern used to search the LU names. Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a lemma preceeds the “.” and a POS follows the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). If ‘name’ is not given, then all LUs will be returned.

The valid POSes are:

v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj - interjection art - article c - conjunction scon - subordinating conjunction
Returns:A list of selected (or all) lexical units
Return type:list of LU objects (dicts) See the lu() function for info about the specifics of LU objects.
propagate_semtypes()[source]

Apply inference rules to distribute semtypes over relations between FEs. For FrameNet 1.5, this results in 1011 semtypes being propagated. (Not done by default because it requires loading all frame files, which takes several seconds. If this needed to be fast, it could be rewritten to traverse the neighboring relations on demand for each FE semtype.)

>>> from nltk.corpus import framenet as fn
>>> x = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
>>> fn.propagate_semtypes()
>>> y = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
>>> y-x > 1000
True
readme()[source]

Return the contents of the corpus README.txt (or README) file.

semtype(key)[source]
>>> from nltk.corpus import framenet as fn
>>> fn.semtype(233).name
'Temperature'
>>> fn.semtype(233).abbrev
'Temp'
>>> fn.semtype('Temperature').ID
233
Parameters:key (string or int) – The name, abbreviation, or id number of the semantic type
Returns:Information about a semantic type
Return type:dict
semtype_inherits(st, superST)[source]
semtypes()[source]

Obtain a list of semantic types.

>>> from nltk.corpus import framenet as fn
>>> stypes = fn.semtypes()
>>> len(stypes) in (73, 109) # FN 1.5 and 1.7, resp.
True
>>> sorted(stypes[0].keys())
['ID', '_type', 'abbrev', 'definition', 'definitionMarkup', 'name', 'rootType', 'subTypes', 'superType']
Returns:A list of all of the semantic types in framenet
Return type:list(dict)
sents(exemplars=True, full_text=True)[source]

Annotated sentences matching the specified criteria.

warnings(v)[source]

Enable or disable warnings of data integrity issues as they are encountered. If v is truthy, warnings will be enabled.

(This is a function rather than just an attribute/property to ensure that if enabling warnings is the first action taken, the corpus reader is instantiated first.)

exception nltk.corpus.reader.framenet.FramenetError[source]

Bases: Exception

An exception class for framenet-related errors.

class nltk.corpus.reader.framenet.Future(loader, *args, **kwargs)[source]

Bases: object

Wraps and acts as a proxy for a value to be loaded lazily (on demand). Adapted from https://gist.github.com/sergey-miryanov/2935416

class nltk.corpus.reader.framenet.PrettyDict(*args, **kwargs)[source]

Bases: nltk.corpus.reader.framenet.AttrDict

Displays an abbreviated repr of values where possible. Inherits from AttrDict, so a callable value will be lazily converted to an actual value.

unicode_repr()

Return repr(self).

class nltk.corpus.reader.framenet.PrettyLazyConcatenation(list_of_lists)[source]

Bases: nltk.collections.LazyConcatenation

Displays an abbreviated repr of only the first several elements, not the whole list.

unicode_repr()

Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.

class nltk.corpus.reader.framenet.PrettyLazyIteratorList(it, known_len=None)[source]

Bases: nltk.collections.LazyIteratorList

Displays an abbreviated repr of only the first several elements, not the whole list.

unicode_repr()

Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.

class nltk.corpus.reader.framenet.PrettyLazyMap(function, *lists, **config)[source]

Bases: nltk.collections.LazyMap

Displays an abbreviated repr of only the first several elements, not the whole list.

unicode_repr()

Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.

class nltk.corpus.reader.framenet.PrettyList(*args, **kwargs)[source]

Bases: list

Displays an abbreviated repr of only the first several elements, not the whole list.

unicode_repr()

Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.

class nltk.corpus.reader.framenet.SpecialList(typ, *args, **kwargs)[source]

Bases: list

A list subclass which adds a ‘_type’ attribute for special printing (similar to an AttrDict, though this is NOT an AttrDict subclass).

unicode_repr()

Return repr(self).

nltk.corpus.reader.framenet.demo()[source]
nltk.corpus.reader.framenet.mimic_wrap(lines, wrap_at=65, **kwargs)[source]

Wrap the first of ‘lines’ with textwrap and the remaining lines at exactly the same positions as the first.

nltk.corpus.reader.ieer module

Corpus reader for the Information Extraction and Entity Recognition Corpus.

NIST 1999 Information Extraction: Entity Recognition Evaluation http://www.itl.nist.gov/iad/894.01/tests/ie-er/er_99/er_99.htm

This corpus contains the NEWSWIRE development test data for the NIST 1999 IE-ER Evaluation. The files were taken from the subdirectory: /ie_er_99/english/devtest/newswire/*.ref.nwt and filenames were shortened.

The corpus contains the following files: APW_19980314, APW_19980424, APW_19980429, NYT_19980315, NYT_19980403, and NYT_19980407.

class nltk.corpus.reader.ieer.IEERCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

docs(fileids=None)[source]
parsed_docs(fileids=None)[source]
raw(fileids=None)[source]
class nltk.corpus.reader.ieer.IEERDocument(text, docno=None, doctype=None, date_time=None, headline='')[source]

Bases: object

unicode_repr()

Return repr(self).

nltk.corpus.reader.ieer.documents = ['APW_19980314', 'APW_19980424', 'APW_19980429', 'NYT_19980315', 'NYT_19980403', 'NYT_19980407']

A list of all documents in this corpus.

nltk.corpus.reader.ieer.titles = {'APW_19980314': 'Associated Press Weekly, 14 March 1998', 'APW_19980424': 'Associated Press Weekly, 24 April 1998', 'APW_19980429': 'Associated Press Weekly, 29 April 1998', 'NYT_19980315': 'New York Times, 15 March 1998', 'NYT_19980403': 'New York Times, 3 April 1998', 'NYT_19980407': 'New York Times, 7 April 1998'}

A dictionary whose keys are the names of documents in this corpus; and whose values are descriptions of those documents’ contents.

nltk.corpus.reader.indian module

Indian Language POS-Tagged Corpus Collected by A Kumaran, Microsoft Research, India Distributed with permission

Contents:
  • Bangla: IIT Kharagpur
  • Hindi: Microsoft Research India
  • Marathi: IIT Bombay
  • Telugu: IIIT Hyderabad
class nltk.corpus.reader.indian.IndianCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.indian.IndianCorpusView(corpus_file, encoding, tagged, group_by_sent, tag_mapping_function=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream

nltk.corpus.reader.ipipan module

class nltk.corpus.reader.ipipan.IPIPANCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.

The corpus includes information about text domain, channel and categories. You can access possible values using domains(), channels() and categories(). You can use also this metadata to filter files, e.g.: fileids(channel='prasa'), fileids(categories='publicystyczny').

The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.: tagged_sents(simplify_tags=True).

Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.: tagged_paras(one_tag=False).

You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g. tagged_words(disamb_only=False).

The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g. tagged_words(append_no_space=True). As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.

The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g. words(append_space=True). As a result either ‘ ‘ or (‘ ‘, ‘space’) will be inserted between tokens.

By default, xml entities like &quot; and &amp; are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g. words(replace_xmlentities=False).

categories(fileids=None)[source]
channels(fileids=None)[source]
domains(fileids=None)[source]
fileids(channels=None, domains=None, categories=None)[source]

Return a list of file identifiers for the fileids that make up this corpus.

paras(fileids=None, **kwargs)[source]
raw(fileids=None)[source]
sents(fileids=None, **kwargs)[source]
tagged_paras(fileids=None, **kwargs)[source]
tagged_sents(fileids=None, **kwargs)[source]
tagged_words(fileids=None, **kwargs)[source]
words(fileids=None, **kwargs)[source]
class nltk.corpus.reader.ipipan.IPIPANCorpusView(filename, startpos=0, **kwargs)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

PARAS_MODE = 2
SENTS_MODE = 1
WORDS_MODE = 0
read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream

nltk.corpus.reader.knbc module

class nltk.corpus.reader.knbc.KNBCorpusReader(root, fileids, encoding='utf8', morphs2str=<function <lambda>>)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

This class implements:
  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
  • _read_block, which reads a block from the input stream.
  • _word, which takes a block and returns a list of list of words.
  • _tag, which takes a block and returns a list of list of tagged words.
  • _parse, which takes a block and returns a list of parsed sentences.
The structure of tagged words:
tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others …)
>>> from nltk.corpus.util import LazyCorpusLoader
>>> knbc = LazyCorpusLoader(
...     'knbc/corpus1',
...     KNBCorpusReader,
...     r'.*/KN.*',
...     encoding='euc-jp',
... )
>>> len(knbc.sents()[0])
9
nltk.corpus.reader.knbc.demo()[source]
nltk.corpus.reader.knbc.test()[source]

nltk.corpus.reader.lin module

class nltk.corpus.reader.lin.LinThesaurusCorpusReader(root, badscore=0.0)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.

scored_synonyms(ngram, fileid=None)[source]

Returns a list of scored synonyms (tuples of synonyms and scores) for the current ngram

Parameters:
  • ngram (C{string}) – ngram to lookup
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, list of tuples of scores and synonyms; otherwise, list of tuples of fileids and lists, where inner lists consist of tuples of scores and synonyms.

similarity(ngram1, ngram2, fileid=None)[source]

Returns the similarity score for two ngrams.

Parameters:
  • ngram1 (C{string}) – first ngram to compare
  • ngram2 (C{string}) – second ngram to compare
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, just the score for the two ngrams; otherwise, list of tuples of fileids and scores.

synonyms(ngram, fileid=None)[source]

Returns a list of synonyms for the current ngram.

Parameters:
  • ngram (C{string}) – ngram to lookup
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, list of synonyms; otherwise, list of tuples of fileids and lists, where inner lists contain synonyms.

nltk.corpus.reader.lin.demo()[source]

nltk.corpus.reader.mte module

A reader for corpora whose documents are in MTE format.

class nltk.corpus.reader.mte.MTECorpusReader(root=None, fileids=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset

lemma_paras(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma)
Return type:list(List(List(tuple(str, str))))
lemma_sents(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma)
Return type:list(list(tuple(str, str)))
lemma_words(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma)
Return type:list(tuple(str,str))
paras(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string
Return type:list(list(list(str)))
raw(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a single string.
Return type:str
readme()[source]

Prints some information about this corpus. :return: the content of the attached README file :rtype: str

sents(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings
Return type:list(list(str))
tagged_paras(fileids=None, tagset='msd', tags='')[source]
param fileids:A list specifying the fileids that should be used.
Parameters:
  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns:

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples

Return type:

list(list(list(tuple(str, str))))

tagged_sents(fileids=None, tagset='msd', tags='')[source]
param fileids:A list specifying the fileids that should be used.
Parameters:
  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns:

the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples

Return type:

list(list(tuple(str, str)))

tagged_words(fileids=None, tagset='msd', tags='')[source]
param fileids:A list specifying the fileids that should be used.
Parameters:
  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns:

the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag)

Return type:

list(tuple(str, str))

words(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.mte.MTECorpusView(fileid, tagspec, elt_handler=None)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

Class for lazy viewing the MTE Corpus.

read_block(stream, tagspec=None, elt_handler=None)[source]

Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found.

class nltk.corpus.reader.mte.MTEFileReader(file_path)[source]

Bases: object

Class for loading the content of the multext-east corpus. It parses the xml files and does some tag-filtering depending on the given method parameters.

lemma_paras()[source]
lemma_sents()[source]
lemma_words()[source]
ns = {'tei': 'http://www.tei-c.org/ns/1.0', 'xml': 'http://www.w3.org/XML/1998/namespace'}
para_path = 'TEI/text/body/div/div/p'
paras()[source]
sent_path = 'TEI/text/body/div/div/p/s'
sents()[source]
tag_ns = '{http://www.tei-c.org/ns/1.0}'
tagged_paras(tagset, tags)[source]
tagged_sents(tagset, tags)[source]
tagged_words(tagset, tags)[source]
word_path = 'TEI/text/body/div/div/p/s/(w|c)'
words()[source]
xml_ns = '{http://www.w3.org/XML/1998/namespace}'
class nltk.corpus.reader.mte.MTETagConverter[source]

Bases: object

Class for converting msd tags to universal tags, more conversion options are currently not implemented.

mapping_msd_universal = {'-': 'X', '.': '.', 'A': 'ADJ', 'C': 'CONJ', 'D': 'DET', 'M': 'NUM', 'N': 'NOUN', 'P': 'PRON', 'Q': 'PRT', 'R': 'ADV', 'S': 'ADP', 'V': 'VERB'}
static msd_to_universal(tag)[source]

This function converts the annotation from the Multex-East to the universal tagset as described in Chapter 5 of the NLTK-Book

Unknown Tags will be mapped to X. Punctuation marks are not supported in MSD tags, so

nltk.corpus.reader.mte.xpath(root, path, ns)[source]

nltk.corpus.reader.nkjp module

class nltk.corpus.reader.nkjp.NKJPCorpusReader(root, fileids='.*')[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

HEADER_MODE = 2
RAW_MODE = 3
SENTS_MODE = 1
WORDS_MODE = 0
add_root(fileid)[source]

Add root if necessary to specified fileid.

fileids()[source]

Returns a list of file identifiers for the fileids that make up this corpus.

get_paths()[source]
header(fileids=None, **kwargs)[source]

Returns header(s) of specified fileids.

raw(fileids=None, **kwargs)[source]

Returns words in specified fileids.

sents(fileids=None, **kwargs)[source]

Returns sentences in specified fileids.

tagged_words(fileids=None, **kwargs)[source]

Call with specified tags as a list, e.g. tags=[‘subst’, ‘comp’]. Returns tagged words in specified fileids.

words(fileids=None, **kwargs)[source]

Returns words in specified fileids.

class nltk.corpus.reader.nkjp.NKJPCorpus_Header_View(filename, **kwargs)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:
  • elt (ElementTree) – The element that should be converted.
  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
handle_query()[source]
class nltk.corpus.reader.nkjp.NKJPCorpus_Morph_View(filename, **kwargs)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with ann_morphosyntax.xml files in NKJP corpus.

handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:
  • elt (ElementTree) – The element that should be converted.
  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
handle_query()[source]
class nltk.corpus.reader.nkjp.NKJPCorpus_Segmentation_View(filename, **kwargs)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with ann_segmentation.xml files in NKJP corpus.

get_segm_id(example_word)[source]
get_sent_beg(beg_word)[source]
get_sent_end(end_word)[source]
get_sentences(sent_segm)[source]
handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:
  • elt (ElementTree) – The element that should be converted.
  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
handle_query()[source]
remove_choice(segm)[source]
class nltk.corpus.reader.nkjp.NKJPCorpus_Text_View(filename, **kwargs)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with text.xml files in NKJP corpus.

RAW_MODE = 1
SENTS_MODE = 0
get_segm_id(elt)[source]
handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:
  • elt (ElementTree) – The element that should be converted.
  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
handle_query()[source]
read_block(stream, tagspec=None, elt_handler=None)[source]

Returns text as a list of sentences.

class nltk.corpus.reader.nkjp.XML_Tool(root, filename)[source]

Bases: object

Helper class creating xml file to one without references to nkjp: namespace. That’s needed because the XMLCorpusView assumes that one can find short substrings of XML that are valid XML, which is not true if a namespace is declared at top level

build_preprocessed_file()[source]
remove_preprocessed_file()[source]

nltk.corpus.reader.nombank module

class nltk.corpus.reader.nombank.NombankChainTreePointer(pieces)[source]

Bases: nltk.corpus.reader.nombank.NombankPointer

pieces = None

A list of the pieces that make up this chain. Elements may be either NombankSplitTreePointer or NombankTreePointer pointers.

select(tree)[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.nombank.NombankCorpusReader(root, nomfile, framefiles='', nounsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-noun basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

instances(baseform=None)[source]
Returns:a corpus view that acts as a list of

NombankInstance objects, one for each noun in the corpus.

lines()[source]
Returns:a corpus view that acts as a list of strings, one for

each line in the predicate-argument annotation file.

nouns()[source]
Returns:a corpus view that acts as a list of all noun lemmas

in this corpus (from the nombank.1.0.words file).

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
roleset(roleset_id)[source]
Returns:the xml description for the given roleset.
rolesets(baseform=None)[source]
Returns:list of xml descriptions for rolesets.
class nltk.corpus.reader.nombank.NombankInstance(fileid, sentnum, wordnum, baseform, sensenumber, predicate, predid, arguments, parse_corpus=None)[source]

Bases: object

arguments = None

A list of tuples (argloc, argid), specifying the location and identifier for each of the predicate’s argument in the containing sentence. Argument identifiers are strings such as 'ARG0' or 'ARGM-TMP'. This list does not contain the predicate.

baseform = None

The baseform of the predicate.

fileid = None

The name of the file containing the parse tree for this instance’s sentence.

static parse(s, parse_fileid_xform=None, parse_corpus=None)[source]
parse_corpus = None

A corpus reader for the parse trees corresponding to the instances in this nombank corpus.

predicate = None

A NombankTreePointer indicating the position of this instance’s predicate within its containing sentence.

predid = None

Identifier of the predicate.

roleset

The name of the roleset used by this instance’s predicate. Use nombank.roleset() <NombankCorpusReader.roleset> to look up information about the roleset.

sensenumber = None

The sense number of the predicate.

sentnum = None

The sentence number of this sentence within fileid. Indexing starts from zero.

tree

The parse tree corresponding to this instance, or None if the corresponding tree is not available.

unicode_repr()

Return repr(self).

wordnum = None

The word number of this instance’s predicate within its containing sentence. Word numbers are indexed starting from zero, and include traces and other empty parse elements.

class nltk.corpus.reader.nombank.NombankPointer[source]

Bases: object

A pointer used by nombank to identify one or more constituents in a parse tree. NombankPointer is an abstract base class with three concrete subclasses:

  • NombankTreePointer is used to point to single constituents.
  • NombankSplitTreePointer is used to point to ‘split’ constituents, which consist of a sequence of two or more NombankTreePointer pointers.
  • NombankChainTreePointer is used to point to entire trace chains in a tree. It consists of a sequence of pieces, which can be NombankTreePointer or NombankSplitTreePointer pointers.
class nltk.corpus.reader.nombank.NombankSplitTreePointer(pieces)[source]

Bases: nltk.corpus.reader.nombank.NombankPointer

pieces = None

A list of the pieces that make up this chain. Elements are all NombankTreePointer pointers.

select(tree)[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.nombank.NombankTreePointer(wordnum, height)[source]

Bases: nltk.corpus.reader.nombank.NombankPointer

wordnum:height*wordnum:height*… wordnum:height,

static parse(s)[source]
select(tree)[source]
treepos(tree)[source]

Convert this pointer to a standard ‘tree position’ pointer, given that it points to the given tree.

unicode_repr()

Return repr(self).

nltk.corpus.reader.nps_chat module

class nltk.corpus.reader.nps_chat.NPSChatCorpusReader(root, fileids, wrap_etree=False, tagset=None)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

posts(fileids=None)[source]
tagged_posts(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns:the given file’s text nodes as a list of words and punctuation symbols
Return type:list(str)
xml_posts(fileids=None)[source]

nltk.corpus.reader.opinion_lexicon module

CorpusReader for the Opinion Lexicon.

  • Opinion Lexicon information -
Authors: Minqing Hu and Bing Liu, 2004.
Department of Computer Sicence University of Illinois at Chicago
Contact: Bing Liu, liub@cs.uic.edu
http://www.cs.uic.edu/~liub

Distributed with permission.

Related papers: - Minqing Hu and Bing Liu. “Mining and summarizing customer reviews”.

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-04), Aug 22-25, 2004, Seattle, Washington, USA.
  • Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and
    Comparing Opinions on the Web”. Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.
class nltk.corpus.reader.opinion_lexicon.IgnoreReadmeCorpusView(*args, **kwargs)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

This CorpusView is used to skip the initial readme block of the corpus.

class nltk.corpus.reader.opinion_lexicon.OpinionLexiconCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored.

>>> from nltk.corpus import opinion_lexicon
>>> opinion_lexicon.words()
['2-faced', '2-faces', 'abnormal', 'abolish', ...]

The OpinionLexiconCorpusReader provides shortcuts to retrieve positive/negative words:

>>> opinion_lexicon.negative()
['2-faced', '2-faces', 'abnormal', 'abolish', ...]

Note that words from words() method are sorted by file id, not alphabetically:

>>> opinion_lexicon.words()[0:10]
['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably',
'abominate', 'abomination', 'abort', 'aborted']
>>> sorted(opinion_lexicon.words())[0:10]
['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably',
'abominate', 'abomination', 'abort']
CorpusView

alias of IgnoreReadmeCorpusView

negative()[source]

Return all negative words in alphabetical order.

Returns:a list of negative words.
Return type:list(str)
positive()[source]

Return all positive words in alphabetical order.

Returns:a list of positive words.
Return type:list(str)
words(fileids=None)[source]

Return all words in the opinion lexicon. Note that these words are not sorted in alphabetical order.

Parameters:fileids – a list or regexp specifying the ids of the files whose words have to be returned.
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)

nltk.corpus.reader.panlex_lite module

CorpusReader for PanLex Lite, a stripped down version of PanLex distributed as an SQLite database. See the README.txt in the panlex_lite corpus directory for more information on PanLex Lite.

class nltk.corpus.reader.panlex_lite.Meaning(mn, attr)[source]

Bases: dict

Represents a single PanLex meaning. A meaning is a translation set derived from a single source.

expressions()[source]
Returns:the meaning’s expressions as a dictionary whose keys are language variety uniform identifiers and whose values are lists of expression texts.
Return type:dict
id()[source]
Returns:the meaning’s id.
Return type:int
quality()[source]
Returns:the meaning’s source’s quality (0=worst, 9=best).
Return type:int
source()[source]
Returns:the meaning’s source id.
Return type:int
source_group()[source]
Returns:the meaning’s source group id.
Return type:int
class nltk.corpus.reader.panlex_lite.PanLexLiteCorpusReader(root)[source]

Bases: nltk.corpus.reader.api.CorpusReader

MEANING_Q = '\n SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n ORDER BY dnx2.uq DESC\n '
TRANSLATION_Q = '\n SELECT s.tt, sum(s.uq) AS trq FROM (\n SELECT ex2.tt, max(dnx.uq) AS uq\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n GROUP BY ex2.tt, dnx.ui\n ) s\n GROUP BY s.tt\n ORDER BY trq DESC, s.tt\n '
language_varieties(lc=None)[source]

Return a list of PanLex language varieties.

Parameters:lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned.
Returns:the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name.
Return type:list(tuple)
meanings(expr_uid, expr_tt)[source]

Return a list of meanings for an expression.

Parameters:
  • expr_uid – the expression’s language variety, as a seven-character uniform identifier.
  • expr_tt – the expression’s text.
Returns:

a list of Meaning objects.

Return type:

list(Meaning)

translations(from_uid, from_tt, to_uid)[source]
Return a list of translations for an expression into a single language
variety.
Parameters:
  • from_uid – the source expression’s language variety, as a seven-character uniform identifier.
  • from_tt – the source expression’s text.
  • to_uid – the target language variety, as a seven-character uniform identifier.
:return a list of translation tuples. The first element is the expression
text and the second element is the translation quality.
Return type:list(tuple)

nltk.corpus.reader.pl196x module

class nltk.corpus.reader.pl196x.Pl196xCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.xmldocs.XMLCorpusReader

decode_tag(tag)[source]
head_len = 2770
paras(fileids=None, categories=None, textids=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None, textids=None)[source]
tagged_paras(fileids=None, categories=None, textids=None)[source]
tagged_sents(fileids=None, categories=None, textids=None)[source]
tagged_words(fileids=None, categories=None, textids=None)[source]
textids(fileids=None, categories=None)[source]

In the pl196x corpus each category is stored in single file and thus both methods provide identical functionality. In order to accommodate finer granularity, a non-standard textids() method was implemented. All the main functions can be supplied with a list of required chunks—giving much more control to the user.

words(fileids=None, categories=None, textids=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns:the given file’s text nodes as a list of words and punctuation symbols
Return type:list(str)
xml(fileids=None, categories=None)[source]
class nltk.corpus.reader.pl196x.TEICorpusView(corpus_file, tagged, group_by_sent, group_by_para, tagset=None, head_len=0, textids=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream

nltk.corpus.reader.plaintext module

A reader for corpora that consist of plaintext documents.

class nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.plaintext.PlaintextCorpusReader

A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.

paras(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None, categories=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.plaintext.EuroparlCorpusReader[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from PlaintextCorpusReader except that:

  • Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
  • For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
  • There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
  • The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
chapters(fileids=None)[source]
Returns:the given file(s) as a list of chapters, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
class nltk.corpus.reader.plaintext.PlaintextCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.plaintext.PortugueseCategorizedPlaintextCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader

nltk.corpus.reader.ppattach module

Read lines from the Prepositional Phrase Attachment Corpus.

The PP Attachment Corpus contains several files having the format:

sentence_id verb noun1 preposition noun2 attachment

For example:

42960 gives authority to administration V 46742 gives inventors of microchip N

The PP attachment is to the verb phrase (V) or noun phrase (N), i.e.:

(VP gives (NP authority) (PP to administration)) (VP gives (NP inventors (PP of microchip)))

The corpus contains the following files:

training: training set devset: development test set, used for algorithm development. test: test set, used to report results bitstrings: word classes derived from Mutual Information Clustering for the Wall Street Journal.

Ratnaparkhi, Adwait (1994). A Maximum Entropy Model for Prepositional Phrase Attachment. Proceedings of the ARPA Human Language Technology Conference. [http://www.cis.upenn.edu/~adwait/papers/hlt94.ps]

The PP Attachment Corpus is distributed with NLTK with the permission of the author.

class nltk.corpus.reader.ppattach.PPAttachment(sent, verb, noun1, prep, noun2, attachment)[source]

Bases: object

unicode_repr()

Return repr(self).

class nltk.corpus.reader.ppattach.PPAttachmentCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

sentence_id verb noun1 preposition noun2 attachment

attachments(fileids)[source]
raw(fileids=None)[source]
tuples(fileids)[source]

nltk.corpus.reader.propbank module

class nltk.corpus.reader.propbank.PropbankChainTreePointer(pieces)[source]

Bases: nltk.corpus.reader.propbank.PropbankPointer

pieces = None

A list of the pieces that make up this chain. Elements may be either PropbankSplitTreePointer or PropbankTreePointer pointers.

select(tree)[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.propbank.PropbankCorpusReader(root, propfile, framefiles='', verbsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-verb basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

instances(baseform=None)[source]
Returns:a corpus view that acts as a list of

PropBankInstance objects, one for each noun in the corpus.

lines()[source]
Returns:a corpus view that acts as a list of strings, one for

each line in the predicate-argument annotation file.

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
roleset(roleset_id)[source]
Returns:the xml description for the given roleset.
rolesets(baseform=None)[source]
Returns:list of xml descriptions for rolesets.
verbs()[source]
Returns:a corpus view that acts as a list of all verb lemmas

in this corpus (from the verbs.txt file).

class nltk.corpus.reader.propbank.PropbankInflection(form='-', tense='-', aspect='-', person='-', voice='-')[source]

Bases: object

ACTIVE = 'a'
FINITE = 'v'
FUTURE = 'f'
GERUND = 'g'
INFINITIVE = 'i'
NONE = '-'
PARTICIPLE = 'p'
PASSIVE = 'p'
PAST = 'p'
PERFECT = 'p'
PERFECT_AND_PROGRESSIVE = 'b'
PRESENT = 'n'
PROGRESSIVE = 'o'
THIRD_PERSON = '3'
static parse(s)[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.propbank.PropbankInstance(fileid, sentnum, wordnum, tagger, roleset, inflection, predicate, arguments, parse_corpus=None)[source]

Bases: object

arguments = None

A list of tuples (argloc, argid), specifying the location and identifier for each of the predicate’s argument in the containing sentence. Argument identifiers are strings such as 'ARG0' or 'ARGM-TMP'. This list does not contain the predicate.

baseform

The baseform of the predicate.

fileid = None

The name of the file containing the parse tree for this instance’s sentence.

inflection = None

A PropbankInflection object describing the inflection of this instance’s predicate.

static parse(s, parse_fileid_xform=None, parse_corpus=None)[source]
parse_corpus = None

A corpus reader for the parse trees corresponding to the instances in this propbank corpus.

predicate = None

A PropbankTreePointer indicating the position of this instance’s predicate within its containing sentence.

predid

Identifier of the predicate.

roleset = None

The name of the roleset used by this instance’s predicate. Use propbank.roleset() <PropbankCorpusReader.roleset> to look up information about the roleset.

sensenumber

The sense number of the predicate.

sentnum = None

The sentence number of this sentence within fileid. Indexing starts from zero.

tagger = None

An identifier for the tagger who tagged this instance; or 'gold' if this is an adjuticated instance.

tree

The parse tree corresponding to this instance, or None if the corresponding tree is not available.

unicode_repr()

Return repr(self).

wordnum = None

The word number of this instance’s predicate within its containing sentence. Word numbers are indexed starting from zero, and include traces and other empty parse elements.

class nltk.corpus.reader.propbank.PropbankPointer[source]

Bases: object

A pointer used by propbank to identify one or more constituents in a parse tree. PropbankPointer is an abstract base class with three concrete subclasses:

  • PropbankTreePointer is used to point to single constituents.
  • PropbankSplitTreePointer is used to point to ‘split’ constituents, which consist of a sequence of two or more PropbankTreePointer pointers.
  • PropbankChainTreePointer is used to point to entire trace chains in a tree. It consists of a sequence of pieces, which can be PropbankTreePointer or PropbankSplitTreePointer pointers.
class nltk.corpus.reader.propbank.PropbankSplitTreePointer(pieces)[source]

Bases: nltk.corpus.reader.propbank.PropbankPointer

pieces = None

A list of the pieces that make up this chain. Elements are all PropbankTreePointer pointers.

select(tree)[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.propbank.PropbankTreePointer(wordnum, height)[source]

Bases: nltk.corpus.reader.propbank.PropbankPointer

wordnum:height*wordnum:height*… wordnum:height,

static parse(s)[source]
select(tree)[source]
treepos(tree)[source]

Convert this pointer to a standard ‘tree position’ pointer, given that it points to the given tree.

unicode_repr()

Return repr(self).

nltk.corpus.reader.pros_cons module

CorpusReader for the Pros and Cons dataset.

  • Pros and Cons dataset information -
Contact: Bing Liu, liub@cs.uic.edu
http://www.cs.uic.edu/~liub

Distributed with permission.

Related papers:

  • Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.
    Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.
  • Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing
    Opinions on the Web”. Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan.
class nltk.corpus.reader.pros_cons.ProsConsCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), encoding='utf8', **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.api.CorpusReader

Reader for the Pros and Cons sentence dataset.

>>> from nltk.corpus import pros_cons
>>> pros_cons.sents(categories='Cons')
[['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy',
'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'],
...]
>>> pros_cons.words('IntegratedPros.txt')
['Easy', 'to', 'use', ',', 'economical', '!', ...]
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

sents(fileids=None, categories=None)[source]

Return all sentences in the corpus or in the specified files/categories.

Parameters:
  • fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
  • categories – a list specifying the categories whose sentences have to be returned.
Returns:

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type:

list(list(str))

words(fileids=None, categories=None)[source]

Return all words and punctuation symbols in the corpus or in the specified files/categories.

Parameters:
  • fileids – a list or regexp specifying the ids of the files whose words have to be returned.
  • categories – a list specifying the categories whose words have to be returned.
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

nltk.corpus.reader.reviews module

CorpusReader for reviews corpora (syntax based on Customer Review Corpus).

  • Customer Review Corpus information -
Annotated by: Minqing Hu and Bing Liu, 2004.
Department of Computer Sicence University of Illinois at Chicago
Contact: Bing Liu, liub@cs.uic.edu
http://www.cs.uic.edu/~liub

Distributed with permission.

The “product_reviews_1” and “product_reviews_2” datasets respectively contain annotated customer reviews of 5 and 9 products from amazon.com.

Related papers:

  • Minqing Hu and Bing Liu. “Mining and summarizing customer reviews”.
    Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-04), 2004.
  • Minqing Hu and Bing Liu. “Mining Opinion Features in Customer Reviews”.
    Proceedings of Nineteeth National Conference on Artificial Intelligence (AAAI-2004), 2004.
  • Xiaowen Ding, Bing Liu and Philip S. Yu. “A Holistic Lexicon-Based Appraoch to
    Opinion Mining.” Proceedings of First ACM International Conference on Web Search and Data Mining (WSDM-2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA.

Symbols used in the annotated reviews:

[t] : the title of the review: Each [t] tag starts a review. xxxx[+|-n]: xxxx is a product feature. [+n]: Positive opinion, n is the opinion strength: 3 strongest, and 1 weakest.

Note that the strength is quite subjective. You may want ignore it, but only considering + and -

[-n]: Negative opinion ## : start of each sentence. Each line is a sentence. [u] : feature not appeared in the sentence. [p] : feature not appeared in the sentence. Pronoun resolution is needed. [s] : suggestion or recommendation. [cc]: comparison with a competing product from a different brand. [cs]: comparison with a competing product from the same brand.

Note: Some of the files (e.g. “ipod.txt”, “Canon PowerShot SD500.txt”) do not
provide separation between different reviews. This is due to the fact that the dataset was specifically designed for aspect/feature-based sentiment analysis, for which sentence-level annotation is sufficient. For document- level classification and analysis, this peculiarity should be taken into consideration.
class nltk.corpus.reader.reviews.Review(title=None, review_lines=None)[source]

Bases: object

A Review is the main block of a ReviewsCorpusReader.

add_line(review_line)[source]

Add a line (ReviewLine) to the review.

Parameters:review_line – a ReviewLine instance that belongs to the Review.
features()[source]

Return a list of features in the review. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.

Returns:all features of the review as a list of tuples (feat, score).
Return type:list(tuple)
sents()[source]

Return all tokenized sentences in the review.

Returns:all sentences of the review as lists of tokens.
Return type:list(list(str))
unicode_repr()

Return repr(self).

class nltk.corpus.reader.reviews.ReviewLine(sent, features=None, notes=None)[source]

Bases: object

A ReviewLine represents a sentence of the review, together with (optional) annotations of its features and notes about the reviewed item.

unicode_repr()

Return repr(self).

class nltk.corpus.reader.reviews.ReviewsCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.

>>> from nltk.corpus import product_reviews_1
>>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt')
>>> review = camera_reviews[0]
>>> review.sents()[0]
['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am',
'extremely', 'satisfied', 'with', 'the', 'purchase', '.']
>>> review.features()
[('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'),
('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'),
('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'),
('option', '+1')]

We can also reach the same information directly from the stream:

>>> product_reviews_1.features('Canon_G3.txt')
[('canon powershot g3', '+3'), ('use', '+2'), ...]

We can compute stats for specific product features:

>>> from __future__ import division
>>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> # We use float for backward compatibility with division in Python2.7
>>> mean = tot / n_reviews
>>> print(n_reviews, tot, mean)
15 24 1.6
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

features(fileids=None)[source]

Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.

Parameters:fileids – a list or regexp specifying the ids of the files whose features have to be returned.
Returns:all features for the item(s) in the given file(s).
Return type:list(tuple)
raw(fileids=None)[source]
Parameters:fileids – a list or regexp specifying the fileids of the files that have to be returned as a raw string.
Returns:the given file(s) as a single string.
Return type:str
readme()[source]

Return the contents of the corpus README.txt file.

reviews(fileids=None)[source]

Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.

Parameters:fileids – a list or regexp specifying the ids of the files whose reviews have to be returned.
Returns:the given file(s) as a list of reviews.
sents(fileids=None)[source]

Return all sentences in the corpus or in the specified files.

Parameters:fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
Returns:the given file(s) as a list of sentences, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)[source]

Return all words and punctuation symbols in the corpus or in the specified files.

Parameters:fileids – a list or regexp specifying the ids of the files whose words have to be returned.
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)

nltk.corpus.reader.rte module

Corpus reader for the Recognizing Textual Entailment (RTE) Challenge Corpora.

The files were taken from the RTE1, RTE2 and RTE3 datasets and the files were regularized.

Filenames are of the form rte*_dev.xml and rte*_test.xml. The latter are the gold standard annotated files.

Each entailment corpus is a list of ‘text’/’hypothesis’ pairs. The following example is taken from RTE3:

<pair id="1" entailment="YES" task="IE" length="short" >

   <t>The sale was made to pay Yukos' US$ 27.5 billion tax bill,
   Yuganskneftegaz was originally sold for US$ 9.4 billion to a little known
   company Baikalfinansgroup which was later bought by the Russian
   state-owned oil company Rosneft .</t>

  <h>Baikalfinansgroup was sold to Rosneft.</h>
</pair>

In order to provide globally unique IDs for each pair, a new attribute challenge has been added to the root element entailment-corpus of each file, taking values 1, 2 or 3. The GID is formatted ‘m-n’, where ‘m’ is the challenge number and ‘n’ is the pair ID.

class nltk.corpus.reader.rte.RTECorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for corpora in RTE challenges.

This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.

pairs(fileids)[source]

Build a list of RTEPairs from a RTE corpus.

Parameters:fileids – a list of RTE corpus fileids
Type:list
Return type:list(RTEPair)
class nltk.corpus.reader.rte.RTEPair(pair, challenge=None, id=None, text=None, hyp=None, value=None, task=None, length=None)[source]

Bases: object

Container for RTE text-hypothesis pairs.

The entailment relation is signalled by the value attribute in RTE1, and by entailment in RTE2 and RTE3. These both get mapped on to the entailment attribute of this class.

unicode_repr()

Return repr(self).

nltk.corpus.reader.rte.norm(value_string)[source]

Normalize the string value in an RTE pair’s value or entailment attribute as an integer (1, 0).

Parameters:value_string (str) – the label used to classify a text/hypothesis pair
Return type:int

nltk.corpus.reader.semcor module

Corpus reader for the SemCor Corpus.

class nltk.corpus.reader.semcor.SemcorCorpusReader(root, fileids, wordnet, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

chunk_sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of chunks.
Return type:list(list(list(str)))
chunks(fileids=None)[source]
Returns:the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit.
Return type:list(list(str))
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of word strings.
Return type:list(list(str))
tagged_chunks(fileids=None, tag='pos')[source]
Returns:the given file(s) as a list of tagged chunks, represented in tree form.
Return type:list(Tree)
Parameters:tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
tagged_sents(fileids=None, tag='pos')[source]
Returns:the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form).
Return type:list(list(Tree))
Parameters:tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.semcor.SemcorSentence(num, items)[source]

Bases: list

A list of words, augmented by an attribute num used to record the sentence identifier (the n attribute from the XML).

class nltk.corpus.reader.semcor.SemcorWordView(fileid, unit, bracket_sent, pos_tag, sem_tag, wordnet)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with the BNC corpus.

handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:
  • elt (ElementTree) – The element that should be converted.
  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
handle_sent(elt)[source]
handle_word(elt)[source]

nltk.corpus.reader.senseval module

Read from the Senseval 2 Corpus.

SENSEVAL [http://www.senseval.org/] Evaluation exercises for Word Sense Disambiguation. Organized by ACL-SIGLEX [http://www.siglex.org/]

Prepared by Ted Pedersen <tpederse@umn.edu>, University of Minnesota, http://www.d.umn.edu/~tpederse/data.html Distributed with permission.

The NLTK version of the Senseval 2 files uses well-formed XML. Each instance of the ambiguous words “hard”, “interest”, “line”, and “serve” is tagged with a sense identifier, and supplied with context.

class nltk.corpus.reader.senseval.SensevalCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

instances(fileids=None)[source]
raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
class nltk.corpus.reader.senseval.SensevalCorpusView(fileid, encoding)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream
class nltk.corpus.reader.senseval.SensevalInstance(word, position, context, senses)[source]

Bases: object

unicode_repr()

Return repr(self).

nltk.corpus.reader.sentiwordnet module

An NLTK interface for SentiWordNet

SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity.

For details about SentiWordNet see: http://sentiwordnet.isti.cnr.it/

>>> from nltk.corpus import sentiwordnet as swn
>>> print(swn.senti_synset('breakdown.n.03'))
<breakdown.n.03: PosScore=0.0 NegScore=0.25>
>>> list(swn.senti_synsets('slow'))
[SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'),
SentiSynset('slow.v.03'), SentiSynset('slow.a.01'),
SentiSynset('slow.a.02'), SentiSynset('dense.s.04'),
SentiSynset('slow.a.04'), SentiSynset('boring.s.01'),
SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'),
SentiSynset('behind.r.03')]
>>> happy = swn.senti_synsets('happy', 'a')
>>> happy0 = list(happy)[0]
>>> happy0.pos_score()
0.875
>>> happy0.neg_score()
0.0
>>> happy0.obj_score()
0.125
class nltk.corpus.reader.sentiwordnet.SentiSynset(pos_score, neg_score, synset)[source]

Bases: object

neg_score()[source]
obj_score()[source]
pos_score()[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.sentiwordnet.SentiWordNetCorpusReader(root, fileids, encoding='utf-8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

all_senti_synsets()[source]
senti_synset(*vals)[source]
senti_synsets(string, pos=None)[source]
unicode_repr()

Return repr(self).

nltk.corpus.reader.sinica_treebank module

Sinica Treebank Corpus Sample

http://rocling.iis.sinica.edu.tw/CKIP/engversion/treebank.htm

10,000 parsed sentences, drawn from the Academia Sinica Balanced Corpus of Modern Chinese. Parse tree notation is based on Information-based Case Grammar. Tagset documentation is available at http://www.sinica.edu.tw/SinicaCorpus/modern_e_wordtype.html

Language and Knowledge Processing Group, Institute of Information Science, Academia Sinica

The data is distributed with the Natural Language Toolkit under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike License [http://creativecommons.org/licenses/by-nc-sa/2.5/].

References:

Feng-Yi Chen, Pi-Fang Tsai, Keh-Jiann Chen, and Chu-Ren Huang (1999) The Construction of Sinica Treebank. Computational Linguistics and Chinese Language Processing, 4, pp 87-104.

Huang Chu-Ren, Keh-Jiann Chen, Feng-Yi Chen, Keh-Jiann Chen, Zhao-Ming Gao, and Kuang-Yu Chen. 2000. Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface. Proceedings of 2nd Chinese Language Processing Workshop, Association for Computational Linguistics.

Chen Keh-Jiann and Yu-Ming Hsieh (2004) Chinese Treebanks and Grammar Extraction, Proceedings of IJCNLP-04, pp560-565.

class nltk.corpus.reader.sinica_treebank.SinicaTreebankCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for the sinica treebank.

nltk.corpus.reader.string_category module

Read tuples from a corpus consisting of categorized strings. For example, from the question classification corpus:

NUM:dist How far is it from Denver to Aspen ? LOC:city What county is Modesto , California in ? HUM:desc Who was Galileo ? DESC:def What is an atom ? NUM:date When did Hawaii become a state ?

class nltk.corpus.reader.string_category.StringCategoryCorpusReader(root, fileids, delimiter=' ', encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
tuples(fileids=None)[source]

nltk.corpus.reader.switchboard module

class nltk.corpus.reader.switchboard.SwitchboardCorpusReader(root, tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

discourses()[source]
tagged_discourses(tagset=False)[source]
tagged_turns(tagset=None)[source]
tagged_words(tagset=None)[source]
turns()[source]
words()[source]
class nltk.corpus.reader.switchboard.SwitchboardTurn(words, speaker, id)[source]

Bases: list

A specialized list object used to encode switchboard utterances. The elements of the list are the words in the utterance; and two attributes, speaker and id, are provided to retrieve the spearker identifier and utterance id. Note that utterance ids are only unique within a given discourse.

unicode_repr()

Return repr(self).

nltk.corpus.reader.tagged module

A reader for corpora whose documents contain part-of-speech-tagged words.

class nltk.corpus.reader.tagged.CategorizedTaggedCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.tagged.TaggedCorpusReader

A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.

paras(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None, categories=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None, categories=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None, categories=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None, categories=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.tagged.MacMorphoCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by self.paras() and self.tagged_paras() contains a single sentence.

class nltk.corpus.reader.tagged.TaggedCorpusReader(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form:

word1/tag1 word2/tag2 word3/tag3 ...

But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.

paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.tagged.TaggedCorpusView(corpus_file, encoding, tagged, group_by_sent, group_by_para, sep, word_tokenizer, sent_tokenizer, para_block_reader, tag_mapping_function=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A specialized corpus view for tagged documents. It can be customized via flags to divide the tagged corpus documents up by sentence or paragraph, and to include or omit part of speech tags. TaggedCorpusView objects are typically created by TaggedCorpusReader (not directly by nltk users).

read_block(stream)[source]

Reads one paragraph at a time.

class nltk.corpus.reader.tagged.TimitTaggedCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

A corpus reader for tagged sentences that are included in the TIMIT corpus.

paras()[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
tagged_paras()[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))

nltk.corpus.reader.timit module

Read tokens, phonemes and audio data from the NLTK TIMIT Corpus.

This corpus contains selected portion of the TIMIT corpus.

  • 16 speakers from 8 dialect regions
  • 1 male and 1 female from each dialect region
  • total 130 sentences (10 sentences per speaker. Note that some sentences are shared among other speakers, especially sa1 and sa2 are spoken by all speakers.)
  • total 160 recording of sentences (10 recordings per speaker)
  • audio format: NIST Sphere, single channel, 16kHz sampling,
16 bit sample, PCM encoding

Module contents

The timit corpus reader provides 4 functions and 4 data items.

  • utterances

    List of utterances in the corpus. There are total 160 utterances, each of which corresponds to a unique utterance of a speaker. Here’s an example of an utterance identifier in the list:

    dr1-fvmh0/sx206
      - _----  _---
      | |  |   | |
      | |  |   | |
      | |  |   | `--- sentence number
      | |  |   `----- sentence type (a:all, i:shared, x:exclusive)
      | |  `--------- speaker ID
      | `------------ sex (m:male, f:female)
      `-------------- dialect region (1..8)
    
  • speakers

    List of speaker IDs. An example of speaker ID:

    dr1-fvmh0
    

    Note that if you split an item ID with colon and take the first element of the result, you will get a speaker ID.

    >>> itemid = 'dr1-fvmh0/sx206'
    >>> spkrid , sentid = itemid.split('/')
    >>> spkrid
    'dr1-fvmh0'
    

    The second element of the result is a sentence ID.

  • dictionary()

    Phonetic dictionary of words contained in this corpus. This is a Python dictionary from words to phoneme lists.

  • spkrinfo()

    Speaker information table. It’s a Python dictionary from speaker IDs to records of 10 fields. Speaker IDs the same as the ones in timie.speakers. Each record is a dictionary from field names to values, and the fields are as follows:

    id         speaker ID as defined in the original TIMIT speaker info table
    sex        speaker gender (M:male, F:female)
    dr         speaker dialect region (1:new england, 2:northern,
               3:north midland, 4:south midland, 5:southern, 6:new york city,
               7:western, 8:army brat (moved around))
    use        corpus type (TRN:training, TST:test)
               in this sample corpus only TRN is available
    recdate    recording date
    birthdate  speaker birth date
    ht         speaker height
    race       speaker race (WHT:white, BLK:black, AMR:american indian,
               SPN:spanish-american, ORN:oriental,???:unknown)
    edu        speaker education level (HS:high school, AS:associate degree,
               BS:bachelor's degree (BS or BA), MS:master's degree (MS or MA),
               PHD:doctorate degree (PhD,JD,MD), ??:unknown)
    comments   comments by the recorder
    

The 4 functions are as follows.

  • tokenized(sentences=items, offset=False)

    Given a list of items, returns an iterator of a list of word lists, each of which corresponds to an item (sentence). If offset is set to True, each element of the word list is a tuple of word(string), start offset and end offset, where offset is represented as a number of 16kHz samples.

  • phonetic(sentences=items, offset=False)

    Given a list of items, returns an iterator of a list of phoneme lists, each of which corresponds to an item (sentence). If offset is set to True, each element of the phoneme list is a tuple of word(string), start offset and end offset, where offset is represented as a number of 16kHz samples.

  • audiodata(item, start=0, end=None)

    Given an item, returns a chunk of audio samples formatted into a string. When the fuction is called, if start and end are omitted, the entire samples of the recording will be returned. If only end is omitted, samples from the start offset to the end of the recording will be returned.

  • play(data)

    Play the given audio samples. The audio samples can be obtained from the timit.audiodata function.

class nltk.corpus.reader.timit.SpeakerInfo(id, sex, dr, use, recdate, birthdate, ht, race, edu, comments=None)[source]

Bases: object

unicode_repr()

Return repr(self).

class nltk.corpus.reader.timit.TimitCorpusReader(root, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:

  • timitdic.txt: dictionary of standard transcriptions
  • spkrinfo.txt: table of speaker information

In addition, the root directory should contain one subdirectory for each speaker, containing three files for each utterance:

  • <utterance-id>.txt: text content of utterances
  • <utterance-id>.wrd: tokenized text content of utterances
  • <utterance-id>.phn: phonetic transcription of utterances
  • <utterance-id>.wav: utterance sound file
audiodata(utterance, start=0, end=None)[source]
fileids(filetype=None)[source]

Return a list of file identifiers for the files that make up this corpus.

Parameters:filetype – If specified, then filetype indicates that only the files that have the given type should be returned. Accepted values are: txt, wrd, phn, wav, or metadata,
phone_times(utterances=None)[source]

offset is represented as a number of 16kHz samples!

phone_trees(utterances=None)[source]
phones(utterances=None)[source]
play(utterance, start=0, end=None)[source]

Play the given audio sample.

Parameters:utterance – The utterance id of the sample to play
sent_times(utterances=None)[source]
sentid(utterance)[source]
sents(utterances=None)[source]
spkrid(utterance)[source]
spkrinfo(speaker)[source]
Returns:A dictionary mapping .. something.
spkrutteranceids(speaker)[source]
Returns:A list of all utterances associated with a given

speaker.

transcription_dict()[source]
Returns:A dictionary giving the ‘standard’ transcription for

each word.

utterance(spkrid, sentid)[source]
utteranceids(dialect=None, sex=None, spkrid=None, sent_type=None, sentid=None)[source]
Returns:A list of the utterance identifiers for all

utterances in this corpus, or for the given speaker, dialect region, gender, sentence type, or sentence number, if specified.

wav(utterance, start=0, end=None)[source]
word_times(utterances=None)[source]
words(utterances=None)[source]
nltk.corpus.reader.timit.read_timit_block(stream)[source]

Block reader for timit tagged sentences, which are preceded by a sentence number that will be ignored.

nltk.corpus.reader.toolbox module

Module for reading, writing and manipulating Toolbox databases and settings fileids.

class nltk.corpus.reader.toolbox.ToolboxCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

entries(fileids, **kwargs)[source]
fields(fileids, strip=True, unwrap=True, encoding='utf8', errors='strict', unicode_fields=None)[source]
raw(fileids)[source]
words(fileids, key='lx')[source]
xml(fileids, key=None)[source]
nltk.corpus.reader.toolbox.demo()[source]

nltk.corpus.reader.twitter module

A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON.

class nltk.corpus.reader.twitter.TwitterCorpusReader(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.

Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.

Construct a new Tweet corpus reader for a set of documents located at the given root directory.

If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:

from nltk.corpus import TwitterCorpusReader
reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')

However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:

root = os.environ['TWITTER']
reader = TwitterCorpusReader(root, '.*\.json')

If you want to work directly with the raw Tweets, the json library can be used:

import json
for tweet in reader.docs():
    print(json.dumps(tweet, indent=1, sort_keys=True))
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

docs(fileids=None)[source]

Returns the full Tweet objects, as specified by Twitter documentation on Tweets

Returns:the given file(s) as a list of dictionaries deserialised

from JSON. :rtype: list(dict)

raw(fileids=None)[source]

Return the corpora in their raw form.

strings(fileids=None)[source]

Returns only the text content of Tweets in the file(s)

Returns:the given file(s) as a list of Tweets.
Return type:list(str)
tokenized(fileids=None)[source]
Returns:the given file(s) as a list of the text content of Tweets as

as a list of words, screenanames, hashtags, URLs and punctuation symbols.

Return type:list(list(str))

nltk.corpus.reader.udhr module

UDHR corpus reader. It mostly deals with encodings.

class nltk.corpus.reader.udhr.UdhrCorpusReader(root='udhr')[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

ENCODINGS = [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]
SKIP = {'Hungarian_Magyar-Unicode', 'Vietnamese-VIQR', 'Japanese_Nihongo-JIS', 'Magahi-UTF8', 'Esperanto-T61', 'Chinese_Mandarin-UTF8', 'Burmese_Myanmar-UTF8', 'Marathi-UTF8', 'Vietnamese-VPS', 'Navaho_Dine-Navajo-Navaho-font', 'Magahi-Agra', 'Russian_Russky-UTF8~', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117', 'Vietnamese-TCVN', 'Chinese_Mandarin-HZ', 'Burmese_Myanmar-WinResearcher', 'Lao-UTF8', 'Bhojpuri-Agra', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Amharic-Afenegus6..60375', 'Tamil-UTF8', 'Gujarati-UTF8', 'Czech-Latin2-err', 'Armenian-DallakHelv', 'Tigrinya_Tigrigna-VG2Main'}

nltk.corpus.reader.util module

class nltk.corpus.reader.util.ConcatenatedCorpusView(corpus_views)[source]

Bases: nltk.collections.AbstractLazySequence

A ‘view’ of a corpus file that joins together one or more StreamBackedCorpusViews<StreamBackedCorpusView>. At most one file handle is left open at any time.

close()[source]
iterate_from(start_tok)[source]

Return an iterator that generates the tokens in the corpus file underlying this corpus view, starting at the token number start. If start>=len(self), then this iterator will generate no tokens.

class nltk.corpus.reader.util.PickleCorpusView(fileid, delete_on_gc=False)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using pickle.dump). One use case for this class is to store the result of running feature detection on a corpus to disk. This can be useful when performing feature detection is expensive (so we don’t want to repeat it); but the corpus is too large to store in memory. The following example illustrates this technique:

>>> from nltk.corpus.reader.util import PickleCorpusView
>>> from nltk.util import LazyMap
>>> feature_corpus = LazyMap(detect_features, corpus) 
>>> PickleCorpusView.write(feature_corpus, some_fileid)  
>>> pcv = PickleCorpusView(some_fileid) 
BLOCK_SIZE = 100
PROTOCOL = -1
classmethod cache_to_tempfile(sequence, delete_on_gc=True)[source]

Write the given sequence to a temporary file as a pickle corpus; and then return a PickleCorpusView view for that temporary corpus file.

Parameters:delete_on_gc – If true, then the temporary file will be deleted whenever this object gets garbage-collected.
read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream
classmethod write(sequence, output_file)[source]
class nltk.corpus.reader.util.StreamBackedCorpusView(fileid, block_reader=None, startpos=0, encoding='utf8')[source]

Bases: nltk.collections.AbstractLazySequence

A ‘view’ of a corpus file, which acts like a sequence of tokens: it can be accessed by index, iterated over, etc. However, the tokens are only constructed as-needed – the entire corpus is never stored in memory at once.

The constructor to StreamBackedCorpusView takes two arguments: a corpus fileid (specified as a string or as a PathPointer); and a block reader. A “block reader” is a function that reads zero or more tokens from a stream, and returns them as a list. A very simple example of a block reader is:

>>> def simple_block_reader(stream):
...     return stream.readline().split()

This simple block reader reads a single line at a time, and returns a single token (consisting of a string) for each whitespace-separated substring on the line.

When deciding how to define the block reader for a given corpus, careful consideration should be given to the size of blocks handled by the block reader. Smaller block sizes will increase the memory requirements of the corpus view’s internal data structures (by 2 integers per block). On the other hand, larger block sizes may decrease performance for random access to the corpus. (But note that larger block sizes will not decrease performance for iteration.)

Internally, CorpusView maintains a partial mapping from token index to file position, with one entry per block. When a token with a given index i is requested, the CorpusView constructs it as follows:

  1. First, it searches the toknum/filepos mapping for the token index closest to (but less than or equal to) i.
  2. Then, starting at the file position corresponding to that index, it reads one block at a time using the block reader until it reaches the requested token.

The toknum/filepos mapping is created lazily: it is initially empty, but every time a new block is read, the block’s initial token is added to the mapping. (Thus, the toknum/filepos map has one entry per block.)

In order to increase efficiency for random access patterns that have high degrees of locality, the corpus view may cache one or more blocks.

Note:

Each CorpusView object internally maintains an open file object for its underlying corpus file. This file should be automatically closed when the CorpusView is garbage collected, but if you wish to close it manually, use the close() method. If you access a CorpusView’s items after it has been closed, the file object will be automatically re-opened.

Warning:

If the contents of the file are modified during the lifetime of the CorpusView, then the CorpusView’s behavior is undefined.

Warning:

If a unicode encoding is specified when constructing a CorpusView, then the block reader may only call stream.seek() with offsets that have been returned by stream.tell(); in particular, calling stream.seek() with relative offsets, or with offsets based on string lengths, may lead to incorrect behavior.

Variables:
  • _block_reader – The function used to read a single block from the underlying file stream.
  • _toknum – A list containing the token index of each block that has been processed. In particular, _toknum[i] is the token index of the first token in block i. Together with _filepos, this forms a partial mapping between token indices and file positions.
  • _filepos – A list containing the file position of each block that has been processed. In particular, _toknum[i] is the file position of the first character in block i. Together with _toknum, this forms a partial mapping between token indices and file positions.
  • _stream – The stream used to access the underlying corpus file.
  • _len – The total number of tokens in the corpus, if known; or None, if the number of tokens is not yet known.
  • _eofpos – The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached.
  • _cache – A cache of the most recently read block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block.
close()[source]

Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view). If the corpus view is accessed after it is closed, it will be automatically re-opened.

fileid

The fileid of the file that is accessed by this view.

Type:str or PathPointer
iterate_from(start_tok)[source]

Return an iterator that generates the tokens in the corpus file underlying this corpus view, starting at the token number start. If start>=len(self), then this iterator will generate no tokens.

read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream
nltk.corpus.reader.util.concat(docs)[source]

Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.

nltk.corpus.reader.util.find_corpus_fileids(root, regexp)[source]
nltk.corpus.reader.util.read_alignedsent_block(stream)[source]
nltk.corpus.reader.util.read_blankline_block(stream)[source]
nltk.corpus.reader.util.read_line_block(stream)[source]
nltk.corpus.reader.util.read_regexp_block(stream, start_re, end_re=None)[source]

Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching start_re or EOF is found.

nltk.corpus.reader.util.read_sexpr_block(stream, block_size=16384, comment_char=None)[source]

Read a sequence of s-expressions from the stream, and leave the stream’s file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.

If the file ends in in the middle of an s-expression, then that incomplete s-expression is returned when the end of the file is reached.

Parameters:
  • block_size – The default block size for reading. If an s-expression is longer than one block, then more than one block will be read.
  • comment_char – A character that marks comments. Any lines that begin with this character will be stripped out. (If spaces or tabs precede the comment character, then the line will not be stripped.)
nltk.corpus.reader.util.read_whitespace_block(stream)[source]
nltk.corpus.reader.util.read_wordpunct_block(stream)[source]
nltk.corpus.reader.util.tagged_treebank_para_block_reader(stream)[source]

nltk.corpus.reader.verbnet module

An NLTK interface to the VerbNet verb lexicon

For details about VerbNet see: https://verbs.colorado.edu/~mpalmer/projects/verbnet.html

class nltk.corpus.reader.verbnet.VerbnetCorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

An NLTK interface to the VerbNet verb lexicon.

From the VerbNet site: “VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet (Miller, 1990; Fellbaum, 1998), XTAG (XTAG Research Group, 2001), and FrameNet (Baker et al., 1998).”

For details about VerbNet see: https://verbs.colorado.edu/~mpalmer/projects/verbnet.html

classids(lemma=None, wordnetid=None, fileid=None, classid=None)[source]

Return a list of the VerbNet class identifiers. If a file identifier is specified, then return only the VerbNet class identifiers for classes (and subclasses) defined by that file. If a lemma is specified, then return only VerbNet class identifiers for classes that contain that lemma as a member. If a wordnetid is specified, then return only identifiers for classes that contain that wordnetid as a member. If a classid is specified, then return only identifiers for subclasses of the specified VerbNet class. If nothing is specified, return all classids within VerbNet

fileids(vnclass_ids=None)[source]

Return a list of fileids that make up this corpus. If vnclass_ids is specified, then return the fileids that make up the specified VerbNet class(es).

frames(vnclass)[source]

Given a VerbNet class, this method returns VerbNet frames

The members returned are: 1) Example 2) Description 3) Syntax 4) Semantics

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
Returns:frames - a list of frame dictionaries
lemmas(vnclass=None)[source]

Return a list of all verb lemmas that appear in any class, or in the classid if specified.

longid(shortid)[source]

Returns longid of a VerbNet class

Given a short VerbNet class identifier (eg ‘37.10’), map it to a long id (eg ‘confess-37.10’). If shortid is already a long id, then return it as-is

pprint(vnclass)[source]

Returns pretty printed version of a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree

containing the xml contents of a VerbNet class.

pprint_frames(vnclass, indent='')[source]

Returns pretty version of all frames in a VerbNet class

Return a string containing a pretty-printed representation of the list of frames within the VerbNet class.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
pprint_members(vnclass, indent='')[source]

Returns pretty printed version of members in a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s member verbs.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
pprint_subclasses(vnclass, indent='')[source]

Returns pretty printed version of subclasses of VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s subclasses.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
pprint_themroles(vnclass, indent='')[source]

Returns pretty printed version of thematic roles in a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s thematic roles.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
shortid(longid)[source]

Returns shortid of a VerbNet class

Given a long VerbNet class identifier (eg ‘confess-37.10’), map it to a short id (eg ‘37.10’). If longid is already a short id, then return it as-is.

subclasses(vnclass)[source]

Returns subclass ids, if any exist

Given a VerbNet class, this method returns subclass ids (if they exist) in a list of strings.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
Returns:list of subclasses
themroles(vnclass)[source]

Returns thematic roles participating in a VerbNet class

Members returned as part of roles are- 1) Type 2) Modifiers

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
Returns:themroles: A list of thematic roles in the VerbNet class
vnclass(fileid_or_classid)[source]

Returns VerbNet class ElementTree

Return an ElementTree containing the xml for the specified VerbNet class.

Parameters:fileid_or_classid – An identifier specifying which class should be returned. Can be a file identifier (such as 'put-9.1.xml'), or a VerbNet class identifier (such as 'put-9.1') or a short VerbNet class identifier (such as '9.1').
wordnetids(vnclass=None)[source]

Return a list of all wordnet identifiers that appear in any class, or in classid if specified.

nltk.corpus.reader.wordlist module

class nltk.corpus.reader.wordlist.MWAPPDBCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al. (2014a, 2014b, 2015):

The original source of the full PPDB corpus can be found on http://www.cis.upenn.edu/~ccb/ppdb/

Returns:a list of tuples of similar lexical terms.
entries(fileids='ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs')[source]
Returns:a tuple of synonym word pairs.
mwa_ppdb_xxxl_file = 'ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs'
class nltk.corpus.reader.wordlist.NonbreakingPrefixesCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses’ word tokenizer.

available_langs = {'ca': 'ca', 'catalan': 'ca', 'cs': 'cs', 'czech': 'cs', 'de': 'de', 'dutch': 'nl', 'el': 'el', 'en': 'en', 'english': 'en', 'es': 'es', 'fi': 'fi', 'finnish': 'fi', 'fr': 'fr', 'french': 'fr', 'german': 'de', 'greek': 'el', 'hu': 'hu', 'hungarian': 'hu', 'icelandic': 'is', 'is': 'is', 'it': 'it', 'italian': 'it', 'latvian': 'lv', 'lv': 'lv', 'nl': 'nl', 'pl': 'pl', 'polish': 'pl', 'portuguese': 'pt', 'pt': 'pt', 'ro': 'ro', 'romanian': 'ro', 'ru': 'ru', 'russian': 'ru', 'sk': 'sk', 'sl': 'sl', 'slovak': 'sk', 'slovenian': 'sl', 'spanish': 'es', 'sv': 'sv', 'swedish': 'sv', 'ta': 'ta', 'tamil': 'ta'}
words(lang=None, fileids=None, ignore_lines_startswith='#')[source]

This module returns a list of nonbreaking prefixes for the specified language(s).

>>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')[:10] == [u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J']
True
>>> nbp.words('ta')[:5] == [u'அ', u'ஆ', u'இ', u'ஈ', u'உ']
True
Returns:a list words for the specified language(s).
class nltk.corpus.reader.wordlist.SwadeshCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

entries(fileids=None)[source]
Returns:a tuple of words for the specified fileids.
class nltk.corpus.reader.wordlist.UnicharsCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

This class is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

available_categories = ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
chars(category=None, fileids=None)[source]

This module returns a list of characters from the Perl Unicode Properties. They are very useful when porting Perl tokenizers to Python.

>>> from nltk.corpus import perluniprops as pup
>>> pup.chars('Open_Punctuation')[:5] == [u'(', u'[', u'{', u'༺', u'༼']
True
>>> pup.chars('Currency_Symbol')[:5] == [u'$', u'¢', u'£', u'¤', u'¥']
True
>>> pup.available_categories
['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
Returns:a list of characters given the specific unicode character category
class nltk.corpus.reader.wordlist.WordListCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

raw(fileids=None)[source]
words(fileids=None, ignore_lines_startswith='\n')[source]

nltk.corpus.reader.wordnet module

An NLTK interface for WordNet

WordNet is a lexical database of English. Using synsets, helps find conceptual relationships between words such as hypernyms, hyponyms, synonyms, antonyms etc.

For details about WordNet see: http://wordnet.princeton.edu/

This module also allows you to find lemmas in languages other than English from the Open Multilingual Wordnet http://compling.hss.ntu.edu.sg/omw/

class nltk.corpus.reader.wordnet.Lemma(wordnet_corpus_reader, synset, name, lexname_index, lex_id, syntactic_marker)[source]

Bases: nltk.corpus.reader.wordnet._WordNetObject

The lexical entry for a single morphological form of a sense-disambiguated word.

Create a Lemma from a “<word>.<pos>.<number>.<lemma>” string where: <word> is the morphological stem identifying the synset <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0. <lemma> is the morphological form of interest

Note that <word> and <lemma> can be different, e.g. the Synset ‘salt.n.03’ has the Lemmas ‘salt.n.03.salt’, ‘salt.n.03.saltiness’ and ‘salt.n.03.salinity’.

Lemma attributes, accessible via methods with the same name:

- name: The canonical name of this lemma.
- synset: The synset that this lemma belongs to.
- syntactic_marker: For adjectives, the WordNet string identifying the
syntactic position relative modified noun. See: http://wordnet.princeton.edu/man/wninput.5WN.html#sect10 For all other parts of speech, this attribute is None.
  • count: The frequency of this lemma in wordnet.

Lemma methods:

Lemmas have the following methods for retrieving related Lemmas. They correspond to the names for the pointer symbols defined here: http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 These methods all return lists of Lemmas:

  • antonyms
  • hypernyms, instance_hypernyms
  • hyponyms, instance_hyponyms
  • member_holonyms, substance_holonyms, part_holonyms
  • member_meronyms, substance_meronyms, part_meronyms
  • topic_domains, region_domains, usage_domains
  • attributes
  • derivationally_related_forms
  • entailments
  • causes
  • also_sees
  • verb_groups
  • similar_tos
  • pertainyms
antonyms()[source]
count()[source]

Return the frequency count for this Lemma

frame_ids()[source]
frame_strings()[source]
key()[source]
lang()[source]
name()[source]
pertainyms()[source]
synset()[source]
syntactic_marker()[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.wordnet.Synset(wordnet_corpus_reader)[source]

Bases: nltk.corpus.reader.wordnet._WordNetObject

Create a Synset from a “<lemma>.<pos>.<number>” string where: <lemma> is the word’s morphological stem <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0.

Synset attributes, accessible via methods with the same name:

  • name: The canonical name of this synset, formed using the first lemma of this synset. Note that this may be different from the name passed to the constructor if that string used a different lemma to identify the synset.
  • pos: The synset’s part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB.
  • lemmas: A list of the Lemma objects for this synset.
  • definition: The definition for this synset.
  • examples: A list of example strings for this synset.
  • offset: The offset in the WordNet dict file of this synset.
  • lexname: The name of the lexicographer file containing this synset.

Synset methods:

Synsets have the following methods for retrieving related Synsets. They correspond to the names for the pointer symbols defined here: http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 These methods all return lists of Synsets.

  • hypernyms, instance_hypernyms
  • hyponyms, instance_hyponyms
  • member_holonyms, substance_holonyms, part_holonyms
  • member_meronyms, substance_meronyms, part_meronyms
  • attributes
  • entailments
  • causes
  • also_sees
  • verb_groups
  • similar_tos

Additionally, Synsets support the following methods specific to the hypernym relation:

  • root_hypernyms
  • common_hypernyms
  • lowest_common_hypernyms

Note that Synsets do not support the following relations because these are defined by WordNet as lexical relations:

  • antonyms
  • derivationally_related_forms
  • pertainyms
closure(rel, depth=-1)[source]

Return the transitive closure of source under the rel relationship, breadth-first

>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synset('dog.n.01')
>>> hyp = lambda s:s.hypernyms()
>>> list(dog.closure(hyp))
[Synset('canine.n.02'), Synset('domestic_animal.n.01'),
Synset('carnivore.n.01'), Synset('animal.n.01'),
Synset('placental.n.01'), Synset('organism.n.01'),
Synset('mammal.n.01'), Synset('living_thing.n.01'),
Synset('vertebrate.n.01'), Synset('whole.n.02'),
Synset('chordate.n.01'), Synset('object.n.01'),
Synset('physical_entity.n.01'), Synset('entity.n.01')]
common_hypernyms(other)[source]

Find all synsets that are hypernyms of this synset and the other synset.

Parameters:other (Synset) – other input synset.
Returns:The synsets that are hypernyms of both synsets.
definition()[source]
examples()[source]
frame_ids()[source]
hypernym_distances(distance=0, simulate_root=False)[source]

Get the path(s) from this synset to the root, counting the distance of each node from the initial node on the way. A set of (synset, distance) tuples is returned.

Parameters:distance (int) – the distance (number of edges) from this hypernym to the original hypernym Synset on which this method was called.
Returns:A set of (Synset, int) tuples where each Synset is a hypernym of the first Synset.
hypernym_paths()[source]

Get the path(s) from this synset to the root, where each path is a list of the synset nodes traversed on the way to the root.

Returns:A list of lists, where each list gives the node sequence connecting the initial Synset node and a root node.
jcn_similarity(other, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects.

lch_similarity(other, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma_names(lang='eng')[source]

Return all the lemma_names associated with the synset

lemmas(lang='eng')[source]

Return all the lemma objects associated with the synset

lexname()[source]
lin_similarity(other, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

lowest_common_hypernyms(other, simulate_root=False, use_min_depth=False)[source]

Get a list of lowest synset(s) that both synsets have as a hypernym. When use_min_depth == False this means that the synset which appears as a hypernym of both self and other with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned

However, if use_min_depth == True then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned.

By setting the use_min_depth flag to True, the behavior of NLTK2 can be preserved. This was changed in NLTK3 to give more accurate results in a small set of cases, generally with synsets concerning people. (eg: ‘chef.n.01’, ‘fireman.n.01’, etc.)

This method is an implementation of Ted Pedersen’s “Lowest Common Subsumer” method from the Perl Wordnet module. It can return either “self” or “other” if they are a hypernym of the other.

Parameters:
  • other (Synset) – other input synset
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (False by default) creates a fake root that connects all the taxonomies. Set it to True to enable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will need to be added for nouns as well.
  • use_min_depth (bool) – This setting mimics older (v2) behavior of NLTK wordnet If True, will use the min_depth function to calculate the lowest common hypernyms. This is known to give strange results for some synset pairs (eg: ‘chef.n.01’, ‘fireman.n.01’) but is retained for backwards compatibility
Returns:

The synsets that are the lowest common hypernyms of both synsets

max_depth()[source]
Returns:The length of the longest hypernym path from this

synset to the root.

min_depth()[source]
Returns:The length of the shortest hypernym path from this

synset to the root.

name()[source]
offset()[source]
path_similarity(other, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

pos()[source]
res_similarity(other, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

root_hypernyms()[source]

Get the topmost hypernyms of this synset in WordNet.

shortest_path_distance(other, simulate_root=False)[source]

Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, None is returned. If a node is compared with itself 0 is returned.

Parameters:other (Synset) – The Synset to which the shortest path will be found.
Returns:The number of edges in the shortest path connecting the two nodes, or None if no path exists.
tree(rel, depth=-1, cut_mark=None)[source]
>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synset('dog.n.01')
>>> hyp = lambda s:s.hypernyms()
>>> from pprint import pprint
>>> pprint(dog.tree(hyp))
[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'),
  [Synset('animal.n.01'),
   [Synset('organism.n.01'),
    [Synset('living_thing.n.01'),
     [Synset('whole.n.02'),
      [Synset('object.n.01'),
       [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]
unicode_repr()

Return repr(self).

wup_similarity(other, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

class nltk.corpus.reader.wordnet.WordNetCorpusReader(root, omw_reader)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader used to access wordnet or its variants.

ADJ = 'a'
ADJ_SAT = 's'
ADV = 'r'
MORPHOLOGICAL_SUBSTITUTIONS = {'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}
NOUN = 'n'
VERB = 'v'
all_lemma_names(pos=None, lang='eng')[source]

Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.

all_synsets(pos=None)[source]

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

citation(lang='omw')[source]

Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language

custom_lemmas(tab_file, lang)[source]

Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK’s WordNet functions to then be used with that language.

See the “Tab files” section at http://compling.hss.ntu.edu.sg/omw/ for documentation on the Multilingual WordNet tab file format.

Parameters:tab_file – Tab file as a file or file-like object

:type lang str :param lang ISO 639-3 code of the language of the tab file

get_version()[source]
ic(corpus, weight_senses_equally=False, smoothing=1.0)[source]

Creates an information content lookup dictionary from a corpus.

Parameters:corpus (CorpusReader) – The corpus from which we create an information

content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary

jcn_similarity(synset1, synset2, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects.

langs()[source]

return a list of languages supported by Multilingual Wordnet

lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma(name, lang='eng')[source]

Return lemma object that matches the name

lemma_count(lemma)[source]

Return the frequency count for this Lemma

lemma_from_key(key)[source]
lemmas(lemma, pos=None, lang='eng')[source]

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

license(lang='eng')[source]

Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language

lin_similarity(synset1, synset2, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

morphy(form, pos=None, check_exceptions=True)[source]

Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)
of2ss(of)[source]

take an id and return the synsets

path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

readme(lang='omw')[source]

Return the contents of README (for omw) use lang=lang to get the readme for an individual language

res_similarity(synset1, synset2, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

ss2of(ss, lang=None)[source]

return the ID of the synset

synset(name)[source]
synset_from_pos_and_offset(pos, offset)[source]
synset_from_sense_key(sense_key)[source]

Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()

From https://wordnet.princeton.edu/wordnet/man/senseidx.5WN.html: A sense_key is represented as:

lemma % lex_sense (e.g. ‘dog%1:18:01::’)
where lex_sense is encoded as:
ss_type:lex_filenum:lex_id:head_word:head_id

lemma: ASCII text of word/collocation, in lower case ss_type: synset type for the sense (1 digit int)

The synset type is encoded as follows: 1 NOUN 2 VERB 3 ADJECTIVE 4 ADVERB 5 ADJECTIVE SATELLITE

lex_filenum: name of lexicographer file containing the synset for the sense (2 digit int) lex_id: when paired with lemma, uniquely identifies a sense in the lexicographer file (2 digit int) head_word: lemma of the first word in satellite’s head synset

Only used if sense is in an adjective satellite synset
head_id: uniquely identifies sense in a lexicographer file when paired with head_word
Only used if head_word is present (2 digit int)
synsets(lemma, pos=None, lang='eng', check_exceptions=True)[source]

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

words(lang='eng')[source]

return lemmas of the given language as list of words

wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

exception nltk.corpus.reader.wordnet.WordNetError[source]

Bases: Exception

An exception class for wordnet-related errors.

class nltk.corpus.reader.wordnet.WordNetICCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for the WordNet information content corpus.

ic(icfile)[source]

Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.

Parameters:icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”)
Returns:An information content dictionary
nltk.corpus.reader.wordnet.information_content(synset, ic)[source]
nltk.corpus.reader.wordnet.jcn_similarity(synset1, synset2, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects.

nltk.corpus.reader.wordnet.lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

nltk.corpus.reader.wordnet.lin_similarity(synset1, synset2, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

nltk.corpus.reader.wordnet.path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

nltk.corpus.reader.wordnet.res_similarity(synset1, synset2, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

nltk.corpus.reader.wordnet.teardown_module(module=None)[source]
nltk.corpus.reader.wordnet.wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

nltk.corpus.reader.xmldocs module

Corpus reader for corpora whose documents are xml files.

(note – not named ‘xml’ to avoid conflicting w/ standard xml package)

class nltk.corpus.reader.xmldocs.XMLCorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for corpora whose documents are xml files.

Note that the XMLCorpusReader constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info.

raw(fileids=None)[source]
words(fileid=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns:the given file’s text nodes as a list of words and punctuation symbols
Return type:list(str)
xml(fileid=None)[source]
class nltk.corpus.reader.xmldocs.XMLCorpusView(fileid, tagspec, elt_handler=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A corpus view that selects out specified elements from an XML file, and provides a flat list-like interface for accessing them. (Note: XMLCorpusView is not used by XMLCorpusReader itself, but may be used by subclasses of XMLCorpusReader.)

Every XML corpus view has a “tag specification”, indicating what XML elements should be included in the view; and each (non-nested) element that matches this specification corresponds to one item in the view. Tag specifications are regular expressions over tag paths, where a tag path is a list of element tag names, separated by ‘/’, indicating the ancestry of the element. Some examples:

  • 'foo': A top-level element whose tag is foo.
  • 'foo/bar': An element whose tag is bar and whose parent is a top-level element whose tag is foo.
  • '.*/foo': An element whose tag is foo, appearing anywhere in the xml tree.
  • '.*/(foo|bar)': An wlement whose tag is foo or bar, appearing anywhere in the xml tree.

The view items are generated from the selected XML elements via the method handle_elt(). By default, this method returns the element as-is (i.e., as an ElementTree object); but it can be overridden, either via subclassing or via the elt_handler constructor parameter.

handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:
  • elt (ElementTree) – The element that should be converted.
  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
read_block(stream, tagspec=None, elt_handler=None)[source]

Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found.

nltk.corpus.reader.ycoe module

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts. The corpus is distributed by the Oxford Text Archive: http://www.ota.ahds.ac.uk/ It is not included with NLTK.

The YCOE corpus is divided into 100 files, each representing an Old English prose text. Tags used within each text complies to the YCOE standard: http://www-users.york.ac.uk/~lang22/YCOE/YcoeHome.htm

class nltk.corpus.reader.ycoe.YCOECorpusReader(root, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.

documents(fileids=None)[source]

Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.

fileids(documents=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that store the given document(s) if specified.

paras(documents=None)[source]
parsed_sents(documents=None)[source]
sents(documents=None)[source]
tagged_paras(documents=None)[source]
tagged_sents(documents=None)[source]
tagged_words(documents=None)[source]
words(documents=None)[source]
class nltk.corpus.reader.ycoe.YCOEParseCorpusReader(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

Specialized version of the standard bracket parse corpus reader that strips out (CODE …) and (ID …) nodes.

class nltk.corpus.reader.ycoe.YCOETaggedCorpusReader(root, items, encoding='utf8')[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

nltk.corpus.reader.ycoe.documents = {'coadrian.o34': 'Adrian and Ritheus', 'coaelhom.o3': 'Ã\x86lfric, Supplemental Homilies', 'coaelive.o3': "Ã\x86lfric's Lives of Saints", 'coalcuin': 'Alcuin De virtutibus et vitiis', 'coalex.o23': "Alexander's Letter to Aristotle", 'coapollo.o3': 'Apollonius of Tyre', 'coaugust': 'Augustine', 'cobede.o2': "Bede's History of the English Church", 'cobenrul.o3': 'Benedictine Rule', 'coblick.o23': 'Blickling Homilies', 'coboeth.o2': "Boethius' Consolation of Philosophy", 'cobyrhtf.o3': "Byrhtferth's Manual", 'cocanedgD': 'Canons of Edgar (D)', 'cocanedgX': 'Canons of Edgar (X)', 'cocathom1.o3': "Ã\x86lfric's Catholic Homilies I", 'cocathom2.o3': "Ã\x86lfric's Catholic Homilies II", 'cochad.o24': 'Saint Chad', 'cochdrul': 'Chrodegang of Metz, Rule', 'cochristoph': 'Saint Christopher', 'cochronA.o23': 'Anglo-Saxon Chronicle A', 'cochronC': 'Anglo-Saxon Chronicle C', 'cochronD': 'Anglo-Saxon Chronicle D', 'cochronE.o34': 'Anglo-Saxon Chronicle E', 'cocura.o2': 'Cura Pastoralis', 'cocuraC': 'Cura Pastoralis (Cotton)', 'codicts.o34': 'Dicts of Cato', 'codocu1.o1': 'Documents 1 (O1)', 'codocu2.o12': 'Documents 2 (O1/O2)', 'codocu2.o2': 'Documents 2 (O2)', 'codocu3.o23': 'Documents 3 (O2/O3)', 'codocu3.o3': 'Documents 3 (O3)', 'codocu4.o24': 'Documents 4 (O2/O4)', 'coeluc1': 'Honorius of Autun, Elucidarium 1', 'coeluc2': 'Honorius of Autun, Elucidarium 1', 'coepigen.o3': "Ã\x86lfric's Epilogue to Genesis", 'coeuphr': 'Saint Euphrosyne', 'coeust': 'Saint Eustace and his companions', 'coexodusP': 'Exodus (P)', 'cogenesiC': 'Genesis (C)', 'cogregdC.o24': "Gregory's Dialogues (C)", 'cogregdH.o23': "Gregory's Dialogues (H)", 'coherbar': 'Pseudo-Apuleius, Herbarium', 'coinspolD.o34': "Wulfstan's Institute of Polity (D)", 'coinspolX': "Wulfstan's Institute of Polity (X)", 'cojames': 'Saint James', 'colacnu.o23': 'Lacnunga', 'colaece.o2': 'Leechdoms', 'colaw1cn.o3': 'Laws, Cnut I', 'colaw2cn.o3': 'Laws, Cnut II', 'colaw5atr.o3': 'Laws, Ã\x86thelred V', 'colaw6atr.o3': 'Laws, Ã\x86thelred VI', 'colawaf.o2': 'Laws, Alfred', 'colawafint.o2': "Alfred's Introduction to Laws", 'colawger.o34': 'Laws, Gerefa', 'colawine.ox2': 'Laws, Ine', 'colawnorthu.o3': 'Northumbra Preosta Lagu', 'colawwllad.o4': 'Laws, William I, Lad', 'coleofri.o4': 'Leofric', 'colsigef.o3': "Ã\x86lfric's Letter to Sigefyrth", 'colsigewB': "Ã\x86lfric's Letter to Sigeweard (B)", 'colsigewZ.o34': "Ã\x86lfric's Letter to Sigeweard (Z)", 'colwgeat': "Ã\x86lfric's Letter to Wulfgeat", 'colwsigeT': "Ã\x86lfric's Letter to Wulfsige (T)", 'colwsigeXa.o34': "Ã\x86lfric's Letter to Wulfsige (Xa)", 'colwstan1.o3': "Ã\x86lfric's Letter to Wulfstan I", 'colwstan2.o3': "Ã\x86lfric's Letter to Wulfstan II", 'comargaC.o34': 'Saint Margaret (C)', 'comargaT': 'Saint Margaret (T)', 'comart1': 'Martyrology, I', 'comart2': 'Martyrology, II', 'comart3.o23': 'Martyrology, III', 'comarvel.o23': 'Marvels of the East', 'comary': 'Mary of Egypt', 'coneot': 'Saint Neot', 'conicodA': 'Gospel of Nicodemus (A)', 'conicodC': 'Gospel of Nicodemus (C)', 'conicodD': 'Gospel of Nicodemus (D)', 'conicodE': 'Gospel of Nicodemus (E)', 'coorosiu.o2': 'Orosius', 'cootest.o3': 'Heptateuch', 'coprefcath1.o3': "Ã\x86lfric's Preface to Catholic Homilies I", 'coprefcath2.o3': "Ã\x86lfric's Preface to Catholic Homilies II", 'coprefcura.o2': 'Preface to the Cura Pastoralis', 'coprefgen.o3': "Ã\x86lfric's Preface to Genesis", 'copreflives.o3': "Ã\x86lfric's Preface to Lives of Saints", 'coprefsolilo': "Preface to Augustine's Soliloquies", 'coquadru.o23': 'Pseudo-Apuleius, Medicina de quadrupedibus', 'corood': 'History of the Holy Rood-Tree', 'cosevensl': 'Seven Sleepers', 'cosolilo': "St. Augustine's Soliloquies", 'cosolsat1.o4': 'Solomon and Saturn I', 'cosolsat2': 'Solomon and Saturn II', 'cotempo.o3': "Ã\x86lfric's De Temporibus Anni", 'coverhom': 'Vercelli Homilies', 'coverhomE': 'Vercelli Homilies (E)', 'coverhomL': 'Vercelli Homilies (L)', 'covinceB': 'Saint Vincent (Bodley 343)', 'covinsal': 'Vindicta Salvatoris', 'cowsgosp.o3': 'West-Saxon Gospels', 'cowulf.o34': "Wulfstan's Homilies"}

A list of all documents and their titles in ycoe.

Module contents

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.

Corpus Reader Functions

Each corpus module defines one or more “corpus reader functions”, which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

  • If item is one of the unique identifiers listed in the corpus module’s items variable, then the corresponding document will be loaded from the NLTK corpus package.
  • If item is a fileid, then that file will be read.

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

  • words(): list of str
  • sents(): list of (list of str)
  • paras(): list of (list of (list of str))
  • tagged_words(): list of (str,str) tuple
  • tagged_sents(): list of (list of (str,str))
  • tagged_paras(): list of (list of (list of (str,str)))
  • chunked_sents(): list of (Tree w/ (str,str) leaves)
  • parsed_sents(): list of (Tree with str leaves)
  • parsed_paras(): list of (list of (Tree with str leaves))
  • xml(): A single xml ElementTree
  • raw(): unprocessed corpus contents

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()))
The, Fulton, County, Grand, Jury, said, ...
class nltk.corpus.reader.CorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: object

A base class for “corpus reader” classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory. Each file is identified by its file identifier, which is the relative path to the file from the root directory.

A separate subclass is defined for each corpus format. These subclasses define one or more methods that provide ‘views’ on the corpus contents, such as words() (for a list of words) and parsed_sents() (for a list of parsed sentences). Called with no arguments, these methods will return the contents of the entire corpus. For most corpora, these methods define one or more selection arguments, such as fileids or categories, which can be used to select which portion of the corpus should be returned.

abspath(fileid)[source]

Return the absolute path for the given file.

Parameters:fileid (str) – The file identifier for the file whose path should be returned.
Return type:PathPointer
abspaths(fileids=None, include_encoding=False, include_fileid=False)[source]

Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.

Parameters:
  • fileids (None or str or list) – Specifies the set of fileids for which paths should be returned. Can be None, for all fileids; a list of file identifiers, for a specified set of fileids; or a single file identifier, for a single file. Note that the return value is always a list of paths, even if fileids is a single file identifier.
  • include_encoding – If true, then return a list of (path_pointer, encoding) tuples.
Return type:

list(PathPointer)

citation()[source]

Return the contents of the corpus citation.bib file, if it exists.

encoding(file)[source]

Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.

ensure_loaded()[source]

Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).

fileids()[source]

Return a list of file identifiers for the fileids that make up this corpus.

license()[source]

Return the contents of the corpus LICENSE file, if it exists.

open(file)[source]

Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.

Parameters:file – The file identifier of the file to read.
readme()[source]

Return the contents of the corpus README file, if it exists.

root

The directory where this corpus is stored.

Type:PathPointer
unicode_repr()

Return repr(self).

class nltk.corpus.reader.CategorizedCorpusReader(kwargs)[source]

Bases: object

A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method categories(), which returns a list of the categories for the corpus or for a specified set of fileids; and overrides fileids() to take a categories argument, restricting the set of fileids to be returned.

Subclasses are expected to:

  • Call __init__() to set up the mapping.
  • Override all view methods to accept a categories parameter, which can be used instead of the fileids parameter, to select which fileids should be included in the returned view.
categories(fileids=None)[source]

Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.

fileids(categories=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.

class nltk.corpus.reader.PlaintextCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
nltk.corpus.reader.find_corpus_fileids(root, regexp)[source]
class nltk.corpus.reader.TaggedCorpusReader(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form:

word1/tag1 word2/tag2 word3/tag3 ...

But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.

paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.CMUDictCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

dict()[source]
Returns:the cmudict lexicon as a dictionary, whose keys are

lowercase words and whose values are lists of pronunciations.

entries()[source]
Returns:the cmudict lexicon as a list of entries

containing (word, transcriptions) tuples.

raw()[source]
Returns:the cmudict lexicon as a raw string.
words()[source]
Returns:a list of all words defined in the cmudict lexicon.
class nltk.corpus.reader.ConllChunkCorpusReader(root, fileids, chunk_types, encoding='utf8', tagset=None, separator=None)[source]

Bases: nltk.corpus.reader.conll.ConllCorpusReader

A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.

class nltk.corpus.reader.WordListCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

raw(fileids=None)[source]
words(fileids=None, ignore_lines_startswith='\n')[source]
class nltk.corpus.reader.PPAttachmentCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

sentence_id verb noun1 preposition noun2 attachment

attachments(fileids)[source]
raw(fileids=None)[source]
tuples(fileids)[source]
class nltk.corpus.reader.SensevalCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

instances(fileids=None)[source]
raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
class nltk.corpus.reader.IEERCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

docs(fileids=None)[source]
parsed_docs(fileids=None)[source]
raw(fileids=None)[source]
class nltk.corpus.reader.ChunkedCorpusReader(root, fileids, extension='', str2chunktree=<function tagstr2tree>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using nltk.chunk.tagstr2tree.

chunked_paras(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).
Return type:list(list(Tree))
chunked_sents(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).
Return type:list(Tree)
chunked_words(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and chunks. Words are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over (word,tag) tuples or word strings.
Return type:list(tuple(str,str) and Tree)
paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.SinicaTreebankCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for the sinica treebank.

class nltk.corpus.reader.BracketParseCorpusReader(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the “combined” section of the Penn Treebank, e.g. “(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))”.

class nltk.corpus.reader.IndianCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.ToolboxCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

entries(fileids, **kwargs)[source]
fields(fileids, strip=True, unwrap=True, encoding='utf8', errors='strict', unicode_fields=None)[source]
raw(fileids)[source]
words(fileids, key='lx')[source]
xml(fileids, key=None)[source]
class nltk.corpus.reader.TimitCorpusReader(root, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:

  • timitdic.txt: dictionary of standard transcriptions
  • spkrinfo.txt: table of speaker information

In addition, the root directory should contain one subdirectory for each speaker, containing three files for each utterance:

  • <utterance-id>.txt: text content of utterances
  • <utterance-id>.wrd: tokenized text content of utterances
  • <utterance-id>.phn: phonetic transcription of utterances
  • <utterance-id>.wav: utterance sound file
audiodata(utterance, start=0, end=None)[source]
fileids(filetype=None)[source]

Return a list of file identifiers for the files that make up this corpus.

Parameters:filetype – If specified, then filetype indicates that only the files that have the given type should be returned. Accepted values are: txt, wrd, phn, wav, or metadata,
phone_times(utterances=None)[source]

offset is represented as a number of 16kHz samples!

phone_trees(utterances=None)[source]
phones(utterances=None)[source]
play(utterance, start=0, end=None)[source]

Play the given audio sample.

Parameters:utterance – The utterance id of the sample to play
sent_times(utterances=None)[source]
sentid(utterance)[source]
sents(utterances=None)[source]
spkrid(utterance)[source]
spkrinfo(speaker)[source]
Returns:A dictionary mapping .. something.
spkrutteranceids(speaker)[source]
Returns:A list of all utterances associated with a given

speaker.

transcription_dict()[source]
Returns:A dictionary giving the ‘standard’ transcription for

each word.

utterance(spkrid, sentid)[source]
utteranceids(dialect=None, sex=None, spkrid=None, sent_type=None, sentid=None)[source]
Returns:A list of the utterance identifiers for all

utterances in this corpus, or for the given speaker, dialect region, gender, sentence type, or sentence number, if specified.

wav(utterance, start=0, end=None)[source]
word_times(utterances=None)[source]
words(utterances=None)[source]
class nltk.corpus.reader.YCOECorpusReader(root, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.

documents(fileids=None)[source]

Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.

fileids(documents=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that store the given document(s) if specified.

paras(documents=None)[source]
parsed_sents(documents=None)[source]
sents(documents=None)[source]
tagged_paras(documents=None)[source]
tagged_sents(documents=None)[source]
tagged_words(documents=None)[source]
words(documents=None)[source]
class nltk.corpus.reader.MacMorphoCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by self.paras() and self.tagged_paras() contains a single sentence.

class nltk.corpus.reader.SyntaxCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:

  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
  • _read_block, which reads a block from the input stream.
  • _word, which takes a block and returns a list of list of words.
  • _tag, which takes a block and returns a list of list of tagged words.
  • _parse, which takes a block and returns a list of parsed sentences.
parsed_sents(fileids=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.AlpinoCorpusReader(root, encoding='ISO-8859-1', tagset=None)[source]

Bases: nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter ‘ordered’ to the overridden _normalize function. The _parse function can then remain untouched.

class nltk.corpus.reader.RTECorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for corpora in RTE challenges.

This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.

pairs(fileids)[source]

Build a list of RTEPairs from a RTE corpus.

Parameters:fileids – a list of RTE corpus fileids
Type:list
Return type:list(RTEPair)
class nltk.corpus.reader.StringCategoryCorpusReader(root, fileids, delimiter=' ', encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
tuples(fileids=None)[source]
class nltk.corpus.reader.EuroparlCorpusReader[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from PlaintextCorpusReader except that:

  • Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
  • For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
  • There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
  • The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
chapters(fileids=None)[source]
Returns:the given file(s) as a list of chapters, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
class nltk.corpus.reader.CategorizedBracketParseCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>

paras(fileids=None, categories=None)[source]
parsed_paras(fileids=None, categories=None)[source]
parsed_sents(fileids=None, categories=None)[source]
parsed_words(fileids=None, categories=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None)[source]
tagged_paras(fileids=None, categories=None, tagset=None)[source]
tagged_sents(fileids=None, categories=None, tagset=None)[source]
tagged_words(fileids=None, categories=None, tagset=None)[source]
words(fileids=None, categories=None)[source]
class nltk.corpus.reader.CategorizedTaggedCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.tagged.TaggedCorpusReader

A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.

paras(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None, categories=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None, categories=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None, categories=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None, categories=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.CategorizedPlaintextCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.plaintext.PlaintextCorpusReader

A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.

paras(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None, categories=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None, categories=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.PortugueseCategorizedPlaintextCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader

nltk.corpus.reader.tagged_treebank_para_block_reader(stream)[source]
class nltk.corpus.reader.PropbankCorpusReader(root, propfile, framefiles='', verbsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-verb basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

instances(baseform=None)[source]
Returns:a corpus view that acts as a list of

PropBankInstance objects, one for each noun in the corpus.

lines()[source]
Returns:a corpus view that acts as a list of strings, one for

each line in the predicate-argument annotation file.

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
roleset(roleset_id)[source]
Returns:the xml description for the given roleset.
rolesets(baseform=None)[source]
Returns:list of xml descriptions for rolesets.
verbs()[source]
Returns:a corpus view that acts as a list of all verb lemmas

in this corpus (from the verbs.txt file).

class nltk.corpus.reader.VerbnetCorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

An NLTK interface to the VerbNet verb lexicon.

From the VerbNet site: “VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet (Miller, 1990; Fellbaum, 1998), XTAG (XTAG Research Group, 2001), and FrameNet (Baker et al., 1998).”

For details about VerbNet see: https://verbs.colorado.edu/~mpalmer/projects/verbnet.html

classids(lemma=None, wordnetid=None, fileid=None, classid=None)[source]

Return a list of the VerbNet class identifiers. If a file identifier is specified, then return only the VerbNet class identifiers for classes (and subclasses) defined by that file. If a lemma is specified, then return only VerbNet class identifiers for classes that contain that lemma as a member. If a wordnetid is specified, then return only identifiers for classes that contain that wordnetid as a member. If a classid is specified, then return only identifiers for subclasses of the specified VerbNet class. If nothing is specified, return all classids within VerbNet

fileids(vnclass_ids=None)[source]

Return a list of fileids that make up this corpus. If vnclass_ids is specified, then return the fileids that make up the specified VerbNet class(es).

frames(vnclass)[source]

Given a VerbNet class, this method returns VerbNet frames

The members returned are: 1) Example 2) Description 3) Syntax 4) Semantics

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
Returns:frames - a list of frame dictionaries
lemmas(vnclass=None)[source]

Return a list of all verb lemmas that appear in any class, or in the classid if specified.

longid(shortid)[source]

Returns longid of a VerbNet class

Given a short VerbNet class identifier (eg ‘37.10’), map it to a long id (eg ‘confess-37.10’). If shortid is already a long id, then return it as-is

pprint(vnclass)[source]

Returns pretty printed version of a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree

containing the xml contents of a VerbNet class.

pprint_frames(vnclass, indent='')[source]

Returns pretty version of all frames in a VerbNet class

Return a string containing a pretty-printed representation of the list of frames within the VerbNet class.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
pprint_members(vnclass, indent='')[source]

Returns pretty printed version of members in a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s member verbs.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
pprint_subclasses(vnclass, indent='')[source]

Returns pretty printed version of subclasses of VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s subclasses.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
pprint_themroles(vnclass, indent='')[source]

Returns pretty printed version of thematic roles in a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s thematic roles.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
shortid(longid)[source]

Returns shortid of a VerbNet class

Given a long VerbNet class identifier (eg ‘confess-37.10’), map it to a short id (eg ‘37.10’). If longid is already a short id, then return it as-is.

subclasses(vnclass)[source]

Returns subclass ids, if any exist

Given a VerbNet class, this method returns subclass ids (if they exist) in a list of strings.

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
Returns:list of subclasses
themroles(vnclass)[source]

Returns thematic roles participating in a VerbNet class

Members returned as part of roles are- 1) Type 2) Modifiers

Parameters:vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.
Returns:themroles: A list of thematic roles in the VerbNet class
vnclass(fileid_or_classid)[source]

Returns VerbNet class ElementTree

Return an ElementTree containing the xml for the specified VerbNet class.

Parameters:fileid_or_classid – An identifier specifying which class should be returned. Can be a file identifier (such as 'put-9.1.xml'), or a VerbNet class identifier (such as 'put-9.1') or a short VerbNet class identifier (such as '9.1').
wordnetids(vnclass=None)[source]

Return a list of all wordnet identifiers that appear in any class, or in classid if specified.

class nltk.corpus.reader.BNCCorpusReader(root, fileids, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the British National Corpus.

For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554

If you extracted the archive to a directory called BNC, then you can instantiate the reader as:

BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
sents(fileids=None, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_words(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
words(fileids=None, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
class nltk.corpus.reader.ConllCorpusReader(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.Tree'>, tagset=None, separator=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the ConllCorpusReader constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus. By default columns are split by consecutive whitespaces, with the separator argument you can set a string to split by (e.g. ' ').

@todo: Add support for reading from corpora where different
parallel files contain different columns.
@todo: Possibly add caching of the grid corpus view? This would
allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
@todo: Better support for -DOCSTART-. Currently, we just ignore
it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).
CHUNK = 'chunk'

column type for chunk structures

COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')

A list of all column types supported by the conll corpus reader.

IGNORE = 'ignore'

column type for column that should be ignored

NE = 'ne'

column type for named entities

POS = 'pos'

column type for part-of-speech tags

SRL = 'srl'

column type for semantic role labels

TREE = 'tree'

column type for parse trees

WORDS = 'words'

column type for words

chunked_sents(fileids=None, chunk_types=None, tagset=None)[source]
chunked_words(fileids=None, chunk_types=None, tagset=None)[source]
iob_sents(fileids=None, tagset=None)[source]
Returns:a list of lists of word/tag/IOB tuples
Return type:list(list)
Parameters:fileids (None or str or list) – the list of fileids that make up this corpus
iob_words(fileids=None, tagset=None)[source]
Returns:a list of word/tag/IOB tuples
Return type:list(tuple)
Parameters:fileids (None or str or list) – the list of fileids that make up this corpus
parsed_sents(fileids=None, pos_in_tree=None, tagset=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
srl_instances(fileids=None, pos_in_tree=None, flatten=True)[source]
srl_spans(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.XMLCorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for corpora whose documents are xml files.

Note that the XMLCorpusReader constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info.

raw(fileids=None)[source]
words(fileid=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns:the given file’s text nodes as a list of words and punctuation symbols
Return type:list(str)
xml(fileid=None)[source]
class nltk.corpus.reader.NPSChatCorpusReader(root, fileids, wrap_etree=False, tagset=None)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

posts(fileids=None)[source]
tagged_posts(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns:the given file’s text nodes as a list of words and punctuation symbols
Return type:list(str)
xml_posts(fileids=None)[source]
class nltk.corpus.reader.SwadeshCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

entries(fileids=None)[source]
Returns:a tuple of words for the specified fileids.
class nltk.corpus.reader.WordNetCorpusReader(root, omw_reader)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader used to access wordnet or its variants.

ADJ = 'a'
ADJ_SAT = 's'
ADV = 'r'
MORPHOLOGICAL_SUBSTITUTIONS = {'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}
NOUN = 'n'
VERB = 'v'
all_lemma_names(pos=None, lang='eng')[source]

Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.

all_synsets(pos=None)[source]

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

citation(lang='omw')[source]

Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language

custom_lemmas(tab_file, lang)[source]

Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK’s WordNet functions to then be used with that language.

See the “Tab files” section at http://compling.hss.ntu.edu.sg/omw/ for documentation on the Multilingual WordNet tab file format.

Parameters:tab_file – Tab file as a file or file-like object

:type lang str :param lang ISO 639-3 code of the language of the tab file

get_version()[source]
ic(corpus, weight_senses_equally=False, smoothing=1.0)[source]

Creates an information content lookup dictionary from a corpus.

Parameters:corpus (CorpusReader) – The corpus from which we create an information

content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary

jcn_similarity(synset1, synset2, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects.

langs()[source]

return a list of languages supported by Multilingual Wordnet

lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma(name, lang='eng')[source]

Return lemma object that matches the name

lemma_count(lemma)[source]

Return the frequency count for this Lemma

lemma_from_key(key)[source]
lemmas(lemma, pos=None, lang='eng')[source]

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

license(lang='eng')[source]

Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language

lin_similarity(synset1, synset2, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

morphy(form, pos=None, check_exceptions=True)[source]

Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)
of2ss(of)[source]

take an id and return the synsets

path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

readme(lang='omw')[source]

Return the contents of README (for omw) use lang=lang to get the readme for an individual language

res_similarity(synset1, synset2, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

ss2of(ss, lang=None)[source]

return the ID of the synset

synset(name)[source]
synset_from_pos_and_offset(pos, offset)[source]
synset_from_sense_key(sense_key)[source]

Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()

From https://wordnet.princeton.edu/wordnet/man/senseidx.5WN.html: A sense_key is represented as:

lemma % lex_sense (e.g. ‘dog%1:18:01::’)
where lex_sense is encoded as:
ss_type:lex_filenum:lex_id:head_word:head_id

lemma: ASCII text of word/collocation, in lower case ss_type: synset type for the sense (1 digit int)

The synset type is encoded as follows: 1 NOUN 2 VERB 3 ADJECTIVE 4 ADVERB 5 ADJECTIVE SATELLITE

lex_filenum: name of lexicographer file containing the synset for the sense (2 digit int) lex_id: when paired with lemma, uniquely identifies a sense in the lexicographer file (2 digit int) head_word: lemma of the first word in satellite’s head synset

Only used if sense is in an adjective satellite synset
head_id: uniquely identifies sense in a lexicographer file when paired with head_word
Only used if head_word is present (2 digit int)
synsets(lemma, pos=None, lang='eng', check_exceptions=True)[source]

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

words(lang='eng')[source]

return lemmas of the given language as list of words

wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

class nltk.corpus.reader.WordNetICCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for the WordNet information content corpus.

ic(icfile)[source]

Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.

Parameters:icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”)
Returns:An information content dictionary
class nltk.corpus.reader.SwitchboardCorpusReader(root, tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

discourses()[source]
tagged_discourses(tagset=False)[source]
tagged_turns(tagset=None)[source]
tagged_words(tagset=None)[source]
turns()[source]
words()[source]
class nltk.corpus.reader.DependencyCorpusReader(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), para_block_reader=<function read_blankline_block>)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

parsed_sents(fileids=None)[source]
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
tagged_sents(fileids=None)[source]
tagged_words(fileids=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.NombankCorpusReader(root, nomfile, framefiles='', nounsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-noun basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

instances(baseform=None)[source]
Returns:a corpus view that acts as a list of

NombankInstance objects, one for each noun in the corpus.

lines()[source]
Returns:a corpus view that acts as a list of strings, one for

each line in the predicate-argument annotation file.

nouns()[source]
Returns:a corpus view that acts as a list of all noun lemmas

in this corpus (from the nombank.1.0.words file).

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
roleset(roleset_id)[source]
Returns:the xml description for the given roleset.
rolesets(baseform=None)[source]
Returns:list of xml descriptions for rolesets.
class nltk.corpus.reader.IPIPANCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.

The corpus includes information about text domain, channel and categories. You can access possible values using domains(), channels() and categories(). You can use also this metadata to filter files, e.g.: fileids(channel='prasa'), fileids(categories='publicystyczny').

The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.: tagged_sents(simplify_tags=True).

Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.: tagged_paras(one_tag=False).

You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g. tagged_words(disamb_only=False).

The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g. tagged_words(append_no_space=True). As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.

The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g. words(append_space=True). As a result either ‘ ‘ or (‘ ‘, ‘space’) will be inserted between tokens.

By default, xml entities like &quot; and &amp; are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g. words(replace_xmlentities=False).

categories(fileids=None)[source]
channels(fileids=None)[source]
domains(fileids=None)[source]
fileids(channels=None, domains=None, categories=None)[source]

Return a list of file identifiers for the fileids that make up this corpus.

paras(fileids=None, **kwargs)[source]
raw(fileids=None)[source]
sents(fileids=None, **kwargs)[source]
tagged_paras(fileids=None, **kwargs)[source]
tagged_sents(fileids=None, **kwargs)[source]
tagged_words(fileids=None, **kwargs)[source]
words(fileids=None, **kwargs)[source]
class nltk.corpus.reader.Pl196xCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.xmldocs.XMLCorpusReader

decode_tag(tag)[source]
head_len = 2770
paras(fileids=None, categories=None, textids=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None, textids=None)[source]
tagged_paras(fileids=None, categories=None, textids=None)[source]
tagged_sents(fileids=None, categories=None, textids=None)[source]
tagged_words(fileids=None, categories=None, textids=None)[source]
textids(fileids=None, categories=None)[source]

In the pl196x corpus each category is stored in single file and thus both methods provide identical functionality. In order to accommodate finer granularity, a non-standard textids() method was implemented. All the main functions can be supplied with a list of required chunks—giving much more control to the user.

words(fileids=None, categories=None, textids=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns:the given file’s text nodes as a list of words and punctuation symbols
Return type:list(str)
xml(fileids=None, categories=None)[source]
class nltk.corpus.reader.TEICorpusView(corpus_file, tagged, group_by_sent, group_by_para, tagset=None, head_len=0, textids=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream
class nltk.corpus.reader.KNBCorpusReader(root, fileids, encoding='utf8', morphs2str=<function <lambda>>)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

This class implements:
  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
  • _read_block, which reads a block from the input stream.
  • _word, which takes a block and returns a list of list of words.
  • _tag, which takes a block and returns a list of list of tagged words.
  • _parse, which takes a block and returns a list of parsed sentences.
The structure of tagged words:
tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others …)
>>> from nltk.corpus.util import LazyCorpusLoader
>>> knbc = LazyCorpusLoader(
...     'knbc/corpus1',
...     KNBCorpusReader,
...     r'.*/KN.*',
...     encoding='euc-jp',
... )
>>> len(knbc.sents()[0])
9
class nltk.corpus.reader.ChasenCorpusReader(root, fileids, encoding='utf8', sent_splitter=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

paras(fileids=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_paras(fileids=None)[source]
tagged_sents(fileids=None)[source]
tagged_words(fileids=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.CHILDESCorpusReader(root, fileids, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at http://childes.psy.cmu.edu/. The XML version of CHILDES is located at http://childes.psy.cmu.edu/data-xml/. Copy the needed parts of the CHILDES XML corpus into the NLTK data directory (nltk_data/corpora/CHILDES/).

For access to the file text use the usual nltk functions, words(), sents(), tagged_words() and tagged_sents().

MLU(fileids=None, speaker='CHI')[source]
Returns:the given file(s) as a floating number
Return type:list(float)
age(fileids=None, speaker='CHI', month=False)[source]
Returns:the given file(s) as string or int
Return type:list or int
Parameters:month – If true, return months instead of year-month-date
childes_url_base = 'http://childes.psy.cmu.edu/browser/index.php?url='
convert_age(age_year)[source]

Caclculate age in months from a string in CHILDES format

corpus(fileids=None)[source]
Returns:the given file(s) as a dict of (corpus_property_key, value)
Return type:list(dict)
participants(fileids=None)[source]
Returns:the given file(s) as a dict of (participant_property_key, value)
Return type:list(dict)
sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
tagged_sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
tagged_words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (stem, index, dependent_index)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
webview_file(fileid, urlbase=None)[source]

Map a corpus file to its web version on the CHILDES website, and open it in a web browser.

The complete URL to be used is:
childes.childes_url_base + urlbase + fileid.replace(‘.xml’, ‘.cha’)

If no urlbase is passed, we try to calculate it. This requires that the childes corpus was set up to mirror the folder hierarchy under childes.psy.cmu.edu/data-xml/, e.g.: nltk_data/corpora/childes/Eng-USA/Cornell/??? or nltk_data/corpora/childes/Romance/Spanish/Aguirre/???

The function first looks (as a special case) if “Eng-USA” is on the path consisting of <corpus root>+fileid; then if “childes”, possibly followed by “data-xml”, appears. If neither one is found, we use the unmodified fileid and hope for the best. If this is not right, specify urlbase explicitly, e.g., if the corpus root points to the Cornell folder, urlbase=’Eng-USA/Cornell’.

words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of words

Return type:

list(str)

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (stem, index, dependent_index)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
class nltk.corpus.reader.AlignedCorpusReader(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.

aligned_sents(fileids=None)[source]
Returns:the given file(s) as a list of AlignedSent objects.
Return type:list(AlignedSent)
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.TimitTaggedCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

A corpus reader for tagged sentences that are included in the TIMIT corpus.

paras()[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
tagged_paras()[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
class nltk.corpus.reader.LinThesaurusCorpusReader(root, badscore=0.0)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.

scored_synonyms(ngram, fileid=None)[source]

Returns a list of scored synonyms (tuples of synonyms and scores) for the current ngram

Parameters:
  • ngram (C{string}) – ngram to lookup
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, list of tuples of scores and synonyms; otherwise, list of tuples of fileids and lists, where inner lists consist of tuples of scores and synonyms.

similarity(ngram1, ngram2, fileid=None)[source]

Returns the similarity score for two ngrams.

Parameters:
  • ngram1 (C{string}) – first ngram to compare
  • ngram2 (C{string}) – second ngram to compare
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, just the score for the two ngrams; otherwise, list of tuples of fileids and scores.

synonyms(ngram, fileid=None)[source]

Returns a list of synonyms for the current ngram.

Parameters:
  • ngram (C{string}) – ngram to lookup
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, list of synonyms; otherwise, list of tuples of fileids and lists, where inner lists contain synonyms.

class nltk.corpus.reader.SemcorCorpusReader(root, fileids, wordnet, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

chunk_sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of chunks.
Return type:list(list(list(str)))
chunks(fileids=None)[source]
Returns:the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit.
Return type:list(list(str))
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of word strings.
Return type:list(list(str))
tagged_chunks(fileids=None, tag='pos')[source]
Returns:the given file(s) as a list of tagged chunks, represented in tree form.
Return type:list(Tree)
Parameters:tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
tagged_sents(fileids=None, tag='pos')[source]
Returns:the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form).
Return type:list(list(Tree))
Parameters:tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.FramenetCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

A corpus reader for the Framenet Corpus.

>>> from nltk.corpus import framenet as fn
>>> fn.lu(3238).frame.lexUnit['glint.v'] is fn.lu(3238)
True
>>> fn.frame_by_name('Replacing') is fn.lus('replace.v')[0].frame
True
>>> fn.lus('prejudice.n')[0].frame.frameRelations == fn.frame_relations('Partiality')
True
annotations(luNamePattern=None, exemplars=True, full_text=True)[source]

Frame annotation sets matching the specified criteria.

buildindexes()[source]

Build the internal indexes to make look-ups faster.

doc(fn_docid)[source]

Returns the annotated document whose id number is fn_docid. This id number can be obtained by calling the Documents() function.

The dict that is returned from this function will contain the following keys:

  • ‘_type’ : ‘fulltextannotation’

  • ‘sentence’ : a list of sentences in the document
    • Each item in the list is a dict containing the following keys:
      • ‘ID’ : the ID number of the sentence

      • ‘_type’ : ‘sentence’

      • ‘text’ : the text of the sentence

      • ‘paragNo’ : the paragraph number

      • ‘sentNo’ : the sentence number

      • ‘docID’ : the document ID number

      • ‘corpID’ : the corpus ID number

      • ‘aPos’ : the annotation position

      • ‘annotationSet’ : a list of annotation layers for the sentence
        • Each item in the list is a dict containing the following keys:
          • ‘ID’ : the ID number of the annotation set

          • ‘_type’ : ‘annotationset’

          • ‘status’ : either ‘MANUAL’ or ‘UNANN’

          • ‘luName’ : (only if status is ‘MANUAL’)

          • ‘luID’ : (only if status is ‘MANUAL’)

          • ‘frameID’ : (only if status is ‘MANUAL’)

          • ‘frameName’: (only if status is ‘MANUAL’)

          • ‘layer’ : a list of labels for the layer
            • Each item in the layer is a dict containing the following keys:

              • ‘_type’: ‘layer’
              • ‘rank’
              • ‘name’
              • ‘label’ : a list of labels in the layer
                • Each item is a dict containing the following keys:
                  • ‘start’
                  • ‘end’
                  • ‘name’
                  • ‘feID’ (optional)
Parameters:fn_docid (int) – The Framenet id number of the document
Returns:Information about the annotated document
Return type:dict
docs(name=None)[source]

Return a list of the annotated full-text documents in FrameNet, optionally filtered by a regex to be matched against the document name.

docs_metadata(name=None)[source]

Return an index of the annotated documents in Framenet.

Details for a specific annotated document can be obtained using this class’s doc() function and pass it the value of the ‘ID’ field.

>>> from nltk.corpus import framenet as fn
>>> len(fn.docs()) in (78, 107) # FN 1.5 and 1.7, resp.
True
>>> set([x.corpname for x in fn.docs_metadata()])>=set(['ANC', 'KBEval',                     'LUCorpus-v0.3', 'Miscellaneous', 'NTI', 'PropBank'])
True
Parameters:name (str) – A regular expression pattern used to search the file name of each annotated document. The document’s file name contains the name of the corpus that the document is from, followed by two underscores “__” followed by the document name. So, for example, the file name “LUCorpus-v0.3__20000410_nyt-NEW.xml” is from the corpus named “LUCorpus-v0.3” and the document name is “20000410_nyt-NEW.xml”.
Returns:A list of selected (or all) annotated documents
Return type:list of dicts, where each dict object contains the following keys:
  • ’name’
  • ’ID’
  • ’corpid’
  • ’corpname’
  • ’description’
  • ’filename’
exemplars(luNamePattern=None, frame=None, fe=None, fe2=None)[source]

Lexicographic exemplar sentences, optionally filtered by LU name and/or 1-2 FEs that are realized overtly. ‘frame’ may be a name pattern, frame ID, or frame instance. ‘fe’ may be a name pattern or FE instance; if specified, ‘fe2’ may also be specified to retrieve sentences with both overt FEs (in either order).

fe_relations()[source]

Obtain a list of frame element relations.

>>> from nltk.corpus import framenet as fn
>>> ferels = fn.fe_relations()
>>> isinstance(ferels, list)
True
>>> len(ferels) in (10020, 12393)   # FN 1.5 and 1.7, resp.
True
>>> PrettyDict(ferels[0], breakLines=True)
{'ID': 14642,
'_type': 'ferelation',
'frameRelation': <Parent=Abounding_with -- Inheritance -> Child=Lively_place>,
'subFE': <fe ID=11370 name=Degree>,
'subFEName': 'Degree',
'subFrame': <frame ID=1904 name=Lively_place>,
'subID': 11370,
'supID': 2271,
'superFE': <fe ID=2271 name=Degree>,
'superFEName': 'Degree',
'superFrame': <frame ID=262 name=Abounding_with>,
'type': <framerelationtype ID=1 name=Inheritance>}
Returns:A list of all of the frame element relations in framenet
Return type:list(dict)
fes(name=None, frame=None)[source]

Lists frame element objects. If ‘name’ is provided, this is treated as a case-insensitive regular expression to filter by frame name. (Case-insensitivity is because casing of frame element names is not always consistent across frames.) Specify ‘frame’ to filter by a frame name pattern, ID, or object.

>>> from nltk.corpus import framenet as fn
>>> fn.fes('Noise_maker')
[<fe ID=6043 name=Noise_maker>]
>>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound')])
[('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'),
 ('Make_noise', 'Sound_source'), ('Sound_movement', 'Location_of_sound_source'),
 ('Sound_movement', 'Sound'), ('Sound_movement', 'Sound_source'),
 ('Sounds', 'Component_sound'), ('Sounds', 'Location_of_sound_source'),
 ('Sounds', 'Sound_source'), ('Vocalizations', 'Location_of_sound_source'),
 ('Vocalizations', 'Sound_source')]
>>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound',r'(?i)make_noise')])
[('Cause_to_make_noise', 'Sound_maker'),
 ('Make_noise', 'Sound'),
 ('Make_noise', 'Sound_source')]
>>> sorted(set(fe.name for fe in fn.fes('^sound')))
['Sound', 'Sound_maker', 'Sound_source']
>>> len(fn.fes('^sound$'))
2
Parameters:name (str) – A regular expression pattern used to match against frame element names. If ‘name’ is None, then a list of all frame elements will be returned.
Returns:A list of matching frame elements
Return type:list(AttrDict)
frame(fn_fid_or_fname, ignorekeys=[])[source]

Get the details for the specified Frame using the frame’s name or id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame(256)
>>> f.name
'Medical_specialties'
>>> f = fn.frame('Medical_specialties')
>>> f.ID
256
>>> # ensure non-ASCII character in definition doesn't trigger an encoding error:
>>> fn.frame('Imposing_obligation')
frame (1494): Imposing_obligation...

The dict that is returned from this function will contain the following information about the Frame:

  • ‘name’ : the name of the Frame (e.g. ‘Birth’, ‘Apply_heat’, etc.)

  • ‘definition’ : textual definition of the Frame

  • ‘ID’ : the internal ID number of the Frame

  • ‘semTypes’ : a list of semantic types for this frame
    • Each item in the list is a dict containing the following keys:
      • ‘name’ : can be used with the semtype() function
      • ‘ID’ : can be used with the semtype() function
  • ‘lexUnit’ : a dict containing all of the LUs for this frame.

    The keys in this dict are the names of the LUs and the value for each key is itself a dict containing info about the LU (see the lu() function for more info.)

  • ‘FE’ : a dict containing the Frame Elements that are part of this frame

    The keys in this dict are the names of the FEs (e.g. ‘Body_system’) and the values are dicts containing the following keys

    • ‘definition’ : The definition of the FE
    • ‘name’ : The name of the FE e.g. ‘Body_system’
    • ‘ID’ : The id number
    • ‘_type’ : ‘fe’
    • ‘abbrev’ : Abbreviation e.g. ‘bod’
    • ‘coreType’ : one of “Core”, “Peripheral”, or “Extra-Thematic”
    • ‘semType’ : if not None, a dict with the following two keys:
      • ‘name’ : name of the semantic type. can be used with
        the semtype() function
      • ‘ID’ : id number of the semantic type. can be used with
        the semtype() function
    • ‘requiresFE’ : if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame
      • ‘ID’ : the id of the other FE in this frame
    • ‘excludesFE’ : if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame
      • ‘ID’ : the id of the other FE in this frame
  • ‘frameRelation’ : a list of objects describing frame relations

  • ‘FEcoreSets’ : a list of Frame Element core sets for this frame
    • Each item in the list is a list of FE objects
Parameters:
  • fn_fid_or_fname (int or str) – The Framenet name or id number of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

frame_by_id(fn_fid, ignorekeys=[])[source]

Get the details for the specified Frame using the frame’s id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_id(256)
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition
"This frame includes words that name ..."
Parameters:
  • fn_fid (int) – The Framenet id number of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_by_name(fn_fname, ignorekeys=[], check_cache=True)[source]

Get the details for the specified Frame using the frame’s name.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_name('Medical_specialties')
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition
"This frame includes words that name ..."
Parameters:
  • fn_fname (str) – The name of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_ids_and_names(name=None)[source]

Uses the frame index, which is much faster than looking up each frame definition if only the names and IDs are needed.

frame_relation_types()[source]

Obtain a list of frame relation types.

>>> from nltk.corpus import framenet as fn
>>> frts = sorted(fn.frame_relation_types(), key=itemgetter('ID'))
>>> isinstance(frts, list)
True
>>> len(frts) in (9, 10)    # FN 1.5 and 1.7, resp.
True
>>> PrettyDict(frts[0], breakLines=True)
{'ID': 1,
 '_type': 'framerelationtype',
 'frameRelations': [<Parent=Event -- Inheritance -> Child=Change_of_consistency>, <Parent=Event -- Inheritance -> Child=Rotting>, ...],
 'name': 'Inheritance',
 'subFrameName': 'Child',
 'superFrameName': 'Parent'}
Returns:A list of all of the frame relation types in framenet
Return type:list(dict)
frame_relations(frame=None, frame2=None, type=None)[source]
Parameters:frame – (optional) frame object, name, or ID; only relations involving

this frame will be returned :param frame2: (optional; ‘frame’ must be a different frame) only show relations between the two specified frames, in either direction :param type: (optional) frame relation type (name or object); show only relations of this type :type frame: int or str or AttrDict :return: A list of all of the frame relations in framenet :rtype: list(dict)

>>> from nltk.corpus import framenet as fn
>>> frels = fn.frame_relations()
>>> isinstance(frels, list)
True
>>> len(frels) in (1676, 2070)  # FN 1.5 and 1.7, resp.
True
>>> PrettyList(fn.frame_relations('Cooking_creation'), maxReprSize=0, breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>,
 <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
>>> PrettyList(fn.frame_relations(274), breakLines=True)
[<Parent=Avoiding -- Inheritance -> Child=Dodging>,
 <Parent=Avoiding -- Inheritance -> Child=Evading>, ...]
>>> PrettyList(fn.frame_relations(fn.frame('Cooking_creation')), breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>, ...]
>>> PrettyList(fn.frame_relations('Cooking_creation', type='Inheritance'))
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>]
>>> PrettyList(fn.frame_relations('Cooking_creation', 'Apply_heat'), breakLines=True)
[<Parent=Apply_heat -- Using -> Child=Cooking_creation>,
<MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
frames(name=None)[source]

Obtain details for a specific frame.

>>> from nltk.corpus import framenet as fn
>>> len(fn.frames()) in (1019, 1221)    # FN 1.5 and 1.7, resp.
True
>>> x = PrettyList(fn.frames(r'(?i)crim'), maxReprSize=0, breakLines=True)
>>> x.sort(key=itemgetter('ID'))
>>> x
[<frame ID=200 name=Criminal_process>,
 <frame ID=500 name=Criminal_investigation>,
 <frame ID=692 name=Crime_scenario>,
 <frame ID=700 name=Committing_crime>]

A brief intro to Frames (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and props that are needed for that Frame. For example, the “Apply_heat” frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.

We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called “lexical units” (LUs).

FrameNet includes relations between Frames. Several types of relations are defined, of which the most important are:

  • Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
  • Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to child FEs.
  • Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.
  • Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point of view, respectively.
Parameters:name (str) – A regular expression pattern used to match against Frame names. If ‘name’ is None, then a list of all Framenet Frames will be returned.
Returns:A list of matching Frames (or all Frames).
Return type:list(AttrDict)
frames_by_lemma(pat)[source]

Returns a list of all frames that contain LUs in which the name attribute of the LU matchs the given regular expression pat. Note that LU names are composed of “lemma.POS”, where the “lemma” part can be made up of either a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’).

Note: if you are going to be doing a lot of this type of searching, you’d want to build an index that maps from lemmas to frames because each time frames_by_lemma() is called, it has to search through ALL of the frame XML files in the db.

>>> from nltk.corpus import framenet as fn
>>> from nltk.corpus.reader.framenet import PrettyList
>>> PrettyList(sorted(fn.frames_by_lemma(r'(?i)a little'), key=itemgetter('ID'))) 
[<frame ID=189 name=Quanti...>, <frame ID=2001 name=Degree>]
Returns:A list of frame objects.
Return type:list(AttrDict)
ft_sents(docNamePattern=None)[source]

Full-text annotation sentences, optionally filtered by document name.

help(attrname=None)[source]

Display help information summarizing the main methods.

lu(fn_luid, ignorekeys=[], luName=None, frameID=None, frameName=None)[source]

Access a lexical unit by its ID. luName, frameID, and frameName are used only in the event that the LU does not have a file in the database (which is the case for LUs with “Problem” status); in this case, a placeholder LU is created which just contains its name, ID, and frame.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> fn.lu(256).name
'foresee.v'
>>> fn.lu(256).definition
'COD: be aware of beforehand; predict.'
>>> fn.lu(256).frame.name
'Expectation'
>>> pprint(list(map(PrettyDict, fn.lu(256).lexemes)))
[{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}]
>>> fn.lu(227).exemplars[23]
exemplar sentence (352962):
[sentNo] 0
[aPos] 59699508

[LU] (227) guess.v in Coming_to_believe

[frame] (23) Coming_to_believe

[annotationSet] 2 annotation sets

[POS] 18 tags

[POS_tagset] BNC

[GF] 3 relations

[PT] 3 phrases

[Other] 1 entry

[text] + [Target] + [FE]

When he was inside the house , Culley noticed the characteristic
                                              ------------------
                                              Content

he would n't have guessed at .
--                ******* --
Co                        C1 [Evidence:INI]
 (Co=Cognizer, C1=Content)

The dict that is returned from this function will contain most of the following information about the LU. Note that some LUs do not contain all of these pieces of information - particularly ‘totalAnnotated’ and ‘incorporatedFE’ may be missing in some LUs:

  • ‘name’ : the name of the LU (e.g. ‘merger.n’)

  • ‘definition’ : textual definition of the LU

  • ‘ID’ : the internal ID number of the LU

  • ‘_type’ : ‘lu’

  • ‘status’ : e.g. ‘Created’

  • ‘frame’ : Frame that this LU belongs to

  • ‘POS’ : the part of speech of this LU (e.g. ‘N’)

  • ‘totalAnnotated’ : total number of examples annotated with this LU

  • ‘incorporatedFE’ : FE that incorporates this LU (e.g. ‘Ailment’)

  • ‘sentenceCount’ : a dict with the following two keys:
    • ‘annotated’: number of sentences annotated with this LU
    • ‘total’ : total number of sentences with this LU
  • ‘lexemes’ : a list of dicts describing the lemma of this LU.

    Each dict in the list contains these keys: - ‘POS’ : part of speech e.g. ‘N’ - ‘name’ : either single-lexeme e.g. ‘merger’ or

    multi-lexeme e.g. ‘a little’

    • ‘order’: the order of the lexeme in the lemma (starting from 1)

    • ‘headword’: a boolean (‘true’ or ‘false’)

    • ‘breakBefore’: Can this lexeme be separated from the previous lexeme?
      Consider: “take over.v” as in:

      Germany took over the Netherlands in 2 days. Germany took the Netherlands over in 2 days.

      In this case, ‘breakBefore’ would be “true” for the lexeme “over”. Contrast this with “take after.v” as in:

      Mary takes after her grandmother.

      *Mary takes her grandmother after.

      In this case, ‘breakBefore’ would be “false” for the lexeme “after”

  • ‘lemmaID’ : Can be used to connect lemmas in different LUs

  • ‘semTypes’ : a list of semantic type objects for this LU

  • ‘subCorpus’ : a list of subcorpora
    • Each item in the list is a dict containing the following keys:
      • ‘name’ :
      • ‘sentence’ : a list of sentences in the subcorpus
        • each item in the list is a dict with the following keys:
          • ‘ID’:
          • ‘sentNo’:
          • ‘text’: the text of the sentence
          • ‘aPos’:
          • ‘annotationSet’: a list of annotation sets
            • each item in the list is a dict with the following keys:
              • ‘ID’:
              • ‘status’:
              • ‘layer’: a list of layers
                • each layer is a dict containing the following keys:
                  • ‘name’: layer name (e.g. ‘BNC’)
                  • ‘rank’:
                  • ‘label’: a list of labels for the layer
                    • each label is a dict containing the following keys:
                      • ‘start’: start pos of label in sentence ‘text’ (0-based)
                      • ‘end’: end pos of label in sentence ‘text’ (0-based)
                      • ‘name’: name of label (e.g. ‘NN1’)

Under the hood, this implementation looks up the lexical unit information in the frame definition file. That file does not contain corpus annotations, so the LU files will be accessed on demand if those are needed. In principle, valence patterns could be loaded here too, though these are not currently supported.

Parameters:
  • fn_luid (int) – The id number of the lexical unit
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

All information about the lexical unit

Return type:

dict

lu_basic(fn_luid)[source]

Returns basic information about the LU whose id is fn_luid. This is basically just a wrapper around the lu() function with “subCorpus” info excluded.

>>> from nltk.corpus import framenet as fn
>>> lu = PrettyDict(fn.lu_basic(256), breakLines=True)
>>> # ellipses account for differences between FN 1.5 and 1.7
>>> lu 
{'ID': 256,
 'POS': 'V',
 'URL': u'https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu256.xml',
 '_type': 'lu',
 'cBy': ...,
 'cDate': '02/08/2001 01:27:50 PST Thu',
 'definition': 'COD: be aware of beforehand; predict.',
 'definitionMarkup': 'COD: be aware of beforehand; predict.',
 'frame': <frame ID=26 name=Expectation>,
 'lemmaID': 15082,
 'lexemes': [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}],
 'name': 'foresee.v',
 'semTypes': [],
 'sentenceCount': {'annotated': ..., 'total': ...},
 'status': 'FN1_Sent'}
Parameters:fn_luid (int) – The id number of the desired LU
Returns:Basic information about the lexical unit
Return type:dict
lu_ids_and_names(name=None)[source]

Uses the LU index, which is much faster than looking up each LU definition if only the names and IDs are needed.

lus(name=None, frame=None)[source]

Obtain details for lexical units. Optionally restrict by lexical unit name pattern, and/or to a certain frame or frames whose name matches a pattern.

>>> from nltk.corpus import framenet as fn
>>> len(fn.lus()) in (11829, 13572) # FN 1.5 and 1.7, resp.
True
>>> PrettyList(sorted(fn.lus(r'(?i)a little'), key=itemgetter('ID')), maxReprSize=0, breakLines=True)
[<lu ID=14733 name=a little.n>,
 <lu ID=14743 name=a little.adv>,
 <lu ID=14744 name=a little bit.adv>]
>>> PrettyList(sorted(fn.lus(r'interest', r'(?i)stimulus'), key=itemgetter('ID')))
[<lu ID=14894 name=interested.a>, <lu ID=14920 name=interesting.a>]

A brief intro to Lexical Units (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat” Frame describes a common situation involving a Cook, some Food, and a Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of a polysemous word is a different LU.

We have used the word “word” in talking about LUs. The reality is actually rather complex. When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different frames:

  • Apply_heat: “Michelle baked the potatoes for 45 minutes.”
  • Cooking_creation: “Michelle baked her mother a cake for her birthday.”
  • Absorb_heat: “The potatoes have to bake for more than 30 minutes.”

These constitute three different LUs, with different definitions.

Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively), and their internal structure is not analyzed.

Framenet provides multiple annotated examples of each sense of a word (i.e. each LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial possibilities of the lexical unit.

Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database similar to a thesaurus, grouping together semantically similar words.

In the simplest case, frame-evoking words are verbs such as “fried” in:

“Matilde fried the catfish in a heavy iron skillet.”

Sometimes event nouns may evoke a Frame. For example, “reduction” evokes “Cause_change_of_scalar_position” in:

“…the reduction of debt levels to $665 million from $2.6 billion.”

Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as in:

“They were asleep for hours.”

Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents rather than clearly evoking their own frames.

Parameters:name (str) –

A regular expression pattern used to search the LU names. Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a lemma preceeds the “.” and a POS follows the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). If ‘name’ is not given, then all LUs will be returned.

The valid POSes are:

v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj - interjection art - article c - conjunction scon - subordinating conjunction
Returns:A list of selected (or all) lexical units
Return type:list of LU objects (dicts) See the lu() function for info about the specifics of LU objects.
propagate_semtypes()[source]

Apply inference rules to distribute semtypes over relations between FEs. For FrameNet 1.5, this results in 1011 semtypes being propagated. (Not done by default because it requires loading all frame files, which takes several seconds. If this needed to be fast, it could be rewritten to traverse the neighboring relations on demand for each FE semtype.)

>>> from nltk.corpus import framenet as fn
>>> x = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
>>> fn.propagate_semtypes()
>>> y = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
>>> y-x > 1000
True
readme()[source]

Return the contents of the corpus README.txt (or README) file.

semtype(key)[source]
>>> from nltk.corpus import framenet as fn
>>> fn.semtype(233).name
'Temperature'
>>> fn.semtype(233).abbrev
'Temp'
>>> fn.semtype('Temperature').ID
233
Parameters:key (string or int) – The name, abbreviation, or id number of the semantic type
Returns:Information about a semantic type
Return type:dict
semtype_inherits(st, superST)[source]
semtypes()[source]

Obtain a list of semantic types.

>>> from nltk.corpus import framenet as fn
>>> stypes = fn.semtypes()
>>> len(stypes) in (73, 109) # FN 1.5 and 1.7, resp.
True
>>> sorted(stypes[0].keys())
['ID', '_type', 'abbrev', 'definition', 'definitionMarkup', 'name', 'rootType', 'subTypes', 'superType']
Returns:A list of all of the semantic types in framenet
Return type:list(dict)
sents(exemplars=True, full_text=True)[source]

Annotated sentences matching the specified criteria.

warnings(v)[source]

Enable or disable warnings of data integrity issues as they are encountered. If v is truthy, warnings will be enabled.

(This is a function rather than just an attribute/property to ensure that if enabling warnings is the first action taken, the corpus reader is instantiated first.)

class nltk.corpus.reader.UdhrCorpusReader(root='udhr')[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

ENCODINGS = [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]
SKIP = {'Hungarian_Magyar-Unicode', 'Vietnamese-VIQR', 'Japanese_Nihongo-JIS', 'Magahi-UTF8', 'Esperanto-T61', 'Chinese_Mandarin-UTF8', 'Burmese_Myanmar-UTF8', 'Marathi-UTF8', 'Vietnamese-VPS', 'Navaho_Dine-Navajo-Navaho-font', 'Magahi-Agra', 'Russian_Russky-UTF8~', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117', 'Vietnamese-TCVN', 'Chinese_Mandarin-HZ', 'Burmese_Myanmar-WinResearcher', 'Lao-UTF8', 'Bhojpuri-Agra', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Amharic-Afenegus6..60375', 'Tamil-UTF8', 'Gujarati-UTF8', 'Czech-Latin2-err', 'Armenian-DallakHelv', 'Tigrinya_Tigrigna-VG2Main'}
class nltk.corpus.reader.BNCCorpusReader(root, fileids, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the British National Corpus.

For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554

If you extracted the archive to a directory called BNC, then you can instantiate the reader as:

BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
sents(fileids=None, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_words(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
words(fileids=None, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
class nltk.corpus.reader.SentiWordNetCorpusReader(root, fileids, encoding='utf-8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

all_senti_synsets()[source]
senti_synset(*vals)[source]
senti_synsets(string, pos=None)[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.SentiSynset(pos_score, neg_score, synset)[source]

Bases: object

neg_score()[source]
obj_score()[source]
pos_score()[source]
unicode_repr()

Return repr(self).

class nltk.corpus.reader.TwitterCorpusReader(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.

Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.

Construct a new Tweet corpus reader for a set of documents located at the given root directory.

If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:

from nltk.corpus import TwitterCorpusReader
reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')

However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:

root = os.environ['TWITTER']
reader = TwitterCorpusReader(root, '.*\.json')

If you want to work directly with the raw Tweets, the json library can be used:

import json
for tweet in reader.docs():
    print(json.dumps(tweet, indent=1, sort_keys=True))
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

docs(fileids=None)[source]

Returns the full Tweet objects, as specified by Twitter documentation on Tweets

Returns:the given file(s) as a list of dictionaries deserialised

from JSON. :rtype: list(dict)

raw(fileids=None)[source]

Return the corpora in their raw form.

strings(fileids=None)[source]

Returns only the text content of Tweets in the file(s)

Returns:the given file(s) as a list of Tweets.
Return type:list(str)
tokenized(fileids=None)[source]
Returns:the given file(s) as a list of the text content of Tweets as

as a list of words, screenanames, hashtags, URLs and punctuation symbols.

Return type:list(list(str))
class nltk.corpus.reader.NKJPCorpusReader(root, fileids='.*')[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

HEADER_MODE = 2
RAW_MODE = 3
SENTS_MODE = 1
WORDS_MODE = 0
add_root(fileid)[source]

Add root if necessary to specified fileid.

fileids()[source]

Returns a list of file identifiers for the fileids that make up this corpus.

get_paths()[source]
header(fileids=None, **kwargs)[source]

Returns header(s) of specified fileids.

raw(fileids=None, **kwargs)[source]

Returns words in specified fileids.

sents(fileids=None, **kwargs)[source]

Returns sentences in specified fileids.

tagged_words(fileids=None, **kwargs)[source]

Call with specified tags as a list, e.g. tags=[‘subst’, ‘comp’]. Returns tagged words in specified fileids.

words(fileids=None, **kwargs)[source]

Returns words in specified fileids.

class nltk.corpus.reader.CrubadanCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader used to access language An Crubadan n-gram files.

crubadan_to_iso(lang)[source]

Return ISO 639-3 code given internal Crubadan code

iso_to_crubadan(lang)[source]

Return internal Crubadan code based on ISO 639-3 code

lang_freq(lang)[source]

Return n-gram FreqDist for a specific language given ISO 639-3 language code

langs()[source]

Return a list of supported languages as ISO 639-3 codes

class nltk.corpus.reader.MTECorpusReader(root=None, fileids=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset

lemma_paras(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma)
Return type:list(List(List(tuple(str, str))))
lemma_sents(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma)
Return type:list(list(tuple(str, str)))
lemma_words(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma)
Return type:list(tuple(str,str))
paras(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string
Return type:list(list(list(str)))
raw(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a single string.
Return type:str
readme()[source]

Prints some information about this corpus. :return: the content of the attached README file :rtype: str

sents(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings
Return type:list(list(str))
tagged_paras(fileids=None, tagset='msd', tags='')[source]
param fileids:A list specifying the fileids that should be used.
Parameters:
  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns:

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples

Return type:

list(list(list(tuple(str, str))))

tagged_sents(fileids=None, tagset='msd', tags='')[source]
param fileids:A list specifying the fileids that should be used.
Parameters:
  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns:

the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples

Return type:

list(list(tuple(str, str)))

tagged_words(fileids=None, tagset='msd', tags='')[source]
param fileids:A list specifying the fileids that should be used.
Parameters:
  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns:

the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag)

Return type:

list(tuple(str, str))

words(fileids=None)[source]
param fileids:A list specifying the fileids that should be used.
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.ReviewsCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.

>>> from nltk.corpus import product_reviews_1
>>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt')
>>> review = camera_reviews[0]
>>> review.sents()[0]
['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am',
'extremely', 'satisfied', 'with', 'the', 'purchase', '.']
>>> review.features()
[('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'),
('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'),
('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'),
('option', '+1')]

We can also reach the same information directly from the stream:

>>> product_reviews_1.features('Canon_G3.txt')
[('canon powershot g3', '+3'), ('use', '+2'), ...]

We can compute stats for specific product features:

>>> from __future__ import division
>>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> # We use float for backward compatibility with division in Python2.7
>>> mean = tot / n_reviews
>>> print(n_reviews, tot, mean)
15 24 1.6
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

features(fileids=None)[source]

Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.

Parameters:fileids – a list or regexp specifying the ids of the files whose features have to be returned.
Returns:all features for the item(s) in the given file(s).
Return type:list(tuple)
raw(fileids=None)[source]
Parameters:fileids – a list or regexp specifying the fileids of the files that have to be returned as a raw string.
Returns:the given file(s) as a single string.
Return type:str
readme()[source]

Return the contents of the corpus README.txt file.

reviews(fileids=None)[source]

Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.

Parameters:fileids – a list or regexp specifying the ids of the files whose reviews have to be returned.
Returns:the given file(s) as a list of reviews.
sents(fileids=None)[source]

Return all sentences in the corpus or in the specified files.

Parameters:fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
Returns:the given file(s) as a list of sentences, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)[source]

Return all words and punctuation symbols in the corpus or in the specified files.

Parameters:fileids – a list or regexp specifying the ids of the files whose words have to be returned.
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.OpinionLexiconCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored.

>>> from nltk.corpus import opinion_lexicon
>>> opinion_lexicon.words()
['2-faced', '2-faces', 'abnormal', 'abolish', ...]

The OpinionLexiconCorpusReader provides shortcuts to retrieve positive/negative words:

>>> opinion_lexicon.negative()
['2-faced', '2-faces', 'abnormal', 'abolish', ...]

Note that words from words() method are sorted by file id, not alphabetically:

>>> opinion_lexicon.words()[0:10]
['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably',
'abominate', 'abomination', 'abort', 'aborted']
>>> sorted(opinion_lexicon.words())[0:10]
['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably',
'abominate', 'abomination', 'abort']
CorpusView

alias of IgnoreReadmeCorpusView

negative()[source]

Return all negative words in alphabetical order.

Returns:a list of negative words.
Return type:list(str)
positive()[source]

Return all positive words in alphabetical order.

Returns:a list of positive words.
Return type:list(str)
words(fileids=None)[source]

Return all words in the opinion lexicon. Note that these words are not sorted in alphabetical order.

Parameters:fileids – a list or regexp specifying the ids of the files whose words have to be returned.
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.ProsConsCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), encoding='utf8', **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.api.CorpusReader

Reader for the Pros and Cons sentence dataset.

>>> from nltk.corpus import pros_cons
>>> pros_cons.sents(categories='Cons')
[['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy',
'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'],
...]
>>> pros_cons.words('IntegratedPros.txt')
['Easy', 'to', 'use', ',', 'economical', '!', ...]
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

sents(fileids=None, categories=None)[source]

Return all sentences in the corpus or in the specified files/categories.

Parameters:
  • fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
  • categories – a list specifying the categories whose sentences have to be returned.
Returns:

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type:

list(list(str))

words(fileids=None, categories=None)[source]

Return all words and punctuation symbols in the corpus or in the specified files/categories.

Parameters:
  • fileids – a list or regexp specifying the ids of the files whose words have to be returned.
  • categories – a list specifying the categories whose words have to be returned.
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

class nltk.corpus.reader.CategorizedSentencesCorpusReader(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), sent_tokenizer=None, encoding='utf8', **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.api.CorpusReader

A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.

Examples using the Subjectivity Dataset:

>>> from nltk.corpus import subjectivity
>>> subjectivity.sents()[23]
['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits',
'happened', 'off', 'screen', '.']
>>> subjectivity.categories()
['obj', 'subj']
>>> subjectivity.words(categories='subj')
['smart', 'and', 'alert', ',', 'thirteen', ...]

Examples using the Sentence Polarity Dataset:

>>> from nltk.corpus import sentence_polarity
>>> sentence_polarity.sents()
[['simplistic', ',', 'silly', 'and', 'tedious', '.'], ["it's", 'so', 'laddish',
'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find',
'it', 'funny', '.'], ...]
>>> sentence_polarity.categories()
['neg', 'pos']
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

raw(fileids=None, categories=None)[source]
Parameters:
  • fileids – a list or regexp specifying the fileids that have to be returned as a raw string.
  • categories – a list specifying the categories whose files have to be returned as a raw string.
Returns:

the given file(s) as a single string.

Return type:

str

readme()[source]

Return the contents of the corpus Readme.txt file.

sents(fileids=None, categories=None)[source]

Return all sentences in the corpus or in the specified file(s).

Parameters:
  • fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
  • categories – a list specifying the categories whose sentences have to be returned.
Returns:

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type:

list(list(str))

words(fileids=None, categories=None)[source]

Return all words and punctuation symbols in the corpus or in the specified file(s).

Parameters:
  • fileids – a list or regexp specifying the ids of the files whose words have to be returned.
  • categories – a list specifying the categories whose words have to be returned.
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

class nltk.corpus.reader.ComparativeSentencesCorpusReader(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>), sent_tokenizer=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).

>>> from nltk.corpus import comparative_sentences
>>> comparison = comparative_sentences.comparisons()[0]
>>> comparison.text
['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly',
'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve",
'had', '.']
>>> comparison.entity_2
'models'
>>> (comparison.feature, comparison.keyword)
('rewind', 'more')
>>> len(comparative_sentences.comparisons())
853
CorpusView

alias of nltk.corpus.reader.util.StreamBackedCorpusView

comparisons(fileids=None)[source]

Return all comparisons in the corpus.

Parameters:fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned.
Returns:the given file(s) as a list of Comparison objects.
Return type:list(Comparison)
keywords(fileids=None)[source]

Return a set of all keywords used in the corpus.

Parameters:fileids – a list or regexp specifying the ids of the files whose keywords have to be returned.
Returns:the set of keywords and comparative phrases used in the corpus.
Return type:set(str)
keywords_readme()[source]

Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).

raw(fileids=None)[source]
Parameters:fileids – a list or regexp specifying the fileids that have to be returned as a raw string.
Returns:the given file(s) as a single string.
Return type:str
readme()[source]

Return the contents of the corpus readme file.

sents(fileids=None)[source]

Return all sentences in the corpus.

Parameters:fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
Returns:all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified).
Return type:list(list(str)) or list(str)
words(fileids=None)[source]

Return all words and punctuation symbols in the corpus.

Parameters:fileids – a list or regexp specifying the ids of the files whose words have to be returned.
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.PanLexLiteCorpusReader(root)[source]

Bases: nltk.corpus.reader.api.CorpusReader

MEANING_Q = '\n SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n ORDER BY dnx2.uq DESC\n '
TRANSLATION_Q = '\n SELECT s.tt, sum(s.uq) AS trq FROM (\n SELECT ex2.tt, max(dnx.uq) AS uq\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n GROUP BY ex2.tt, dnx.ui\n ) s\n GROUP BY s.tt\n ORDER BY trq DESC, s.tt\n '
language_varieties(lc=None)[source]

Return a list of PanLex language varieties.

Parameters:lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned.
Returns:the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name.
Return type:list(tuple)
meanings(expr_uid, expr_tt)[source]

Return a list of meanings for an expression.

Parameters:
  • expr_uid – the expression’s language variety, as a seven-character uniform identifier.
  • expr_tt – the expression’s text.
Returns:

a list of Meaning objects.

Return type:

list(Meaning)

translations(from_uid, from_tt, to_uid)[source]
Return a list of translations for an expression into a single language
variety.
Parameters:
  • from_uid – the source expression’s language variety, as a seven-character uniform identifier.
  • from_tt – the source expression’s text.
  • to_uid – the target language variety, as a seven-character uniform identifier.
:return a list of translation tuples. The first element is the expression
text and the second element is the translation quality.
Return type:list(tuple)
class nltk.corpus.reader.NonbreakingPrefixesCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses’ word tokenizer.

available_langs = {'ca': 'ca', 'catalan': 'ca', 'cs': 'cs', 'czech': 'cs', 'de': 'de', 'dutch': 'nl', 'el': 'el', 'en': 'en', 'english': 'en', 'es': 'es', 'fi': 'fi', 'finnish': 'fi', 'fr': 'fr', 'french': 'fr', 'german': 'de', 'greek': 'el', 'hu': 'hu', 'hungarian': 'hu', 'icelandic': 'is', 'is': 'is', 'it': 'it', 'italian': 'it', 'latvian': 'lv', 'lv': 'lv', 'nl': 'nl', 'pl': 'pl', 'polish': 'pl', 'portuguese': 'pt', 'pt': 'pt', 'ro': 'ro', 'romanian': 'ro', 'ru': 'ru', 'russian': 'ru', 'sk': 'sk', 'sl': 'sl', 'slovak': 'sk', 'slovenian': 'sl', 'spanish': 'es', 'sv': 'sv', 'swedish': 'sv', 'ta': 'ta', 'tamil': 'ta'}
words(lang=None, fileids=None, ignore_lines_startswith='#')[source]

This module returns a list of nonbreaking prefixes for the specified language(s).

>>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')[:10] == [u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J']
True
>>> nbp.words('ta')[:5] == [u'அ', u'ஆ', u'இ', u'ஈ', u'உ']
True
Returns:a list words for the specified language(s).
class nltk.corpus.reader.UnicharsCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

This class is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

available_categories = ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
chars(category=None, fileids=None)[source]

This module returns a list of characters from the Perl Unicode Properties. They are very useful when porting Perl tokenizers to Python.

>>> from nltk.corpus import perluniprops as pup
>>> pup.chars('Open_Punctuation')[:5] == [u'(', u'[', u'{', u'༺', u'༼']
True
>>> pup.chars('Currency_Symbol')[:5] == [u'$', u'¢', u'£', u'¤', u'¥']
True
>>> pup.available_categories
['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
Returns:a list of characters given the specific unicode character category
class nltk.corpus.reader.MWAPPDBCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al. (2014a, 2014b, 2015):

The original source of the full PPDB corpus can be found on http://www.cis.upenn.edu/~ccb/ppdb/

Returns:a list of tuples of similar lexical terms.
entries(fileids='ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs')[source]
Returns:a tuple of synonym word pairs.
mwa_ppdb_xxxl_file = 'ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs'