nltk.corpus.reader package¶
Submodules¶
nltk.corpus.reader.aligned module¶
-
class
nltk.corpus.reader.aligned.
AlignedCorpusReader
(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.
-
aligned_sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of AlignedSent objects. Return type: list(AlignedSent)
-
-
class
nltk.corpus.reader.aligned.
AlignedSentCorpusView
(corpus_file, encoding, aligned, group_by_sent, word_tokenizer, sent_tokenizer, alignedsent_block_reader)[source]¶ Bases:
nltk.corpus.reader.util.StreamBackedCorpusView
A specialized corpus view for aligned sentences.
AlignedSentCorpusView
objects are typically created byAlignedCorpusReader
(not directly by nltk users).
nltk.corpus.reader.api module¶
API for corpus readers.
-
class
nltk.corpus.reader.api.
CategorizedCorpusReader
(kwargs)[source]¶ Bases:
object
A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method
categories()
, which returns a list of the categories for the corpus or for a specified set of fileids; and overridesfileids()
to take acategories
argument, restricting the set of fileids to be returned.Subclasses are expected to:
- Call
__init__()
to set up the mapping. - Override all view methods to accept a
categories
parameter, which can be used instead of thefileids
parameter, to select which fileids should be included in the returned view.
- Call
-
class
nltk.corpus.reader.api.
CorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
object
A base class for “corpus reader” classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory. Each file is identified by its
file identifier
, which is the relative path to the file from the root directory.A separate subclass is defined for each corpus format. These subclasses define one or more methods that provide ‘views’ on the corpus contents, such as
words()
(for a list of words) andparsed_sents()
(for a list of parsed sentences). Called with no arguments, these methods will return the contents of the entire corpus. For most corpora, these methods define one or more selection arguments, such asfileids
orcategories
, which can be used to select which portion of the corpus should be returned.-
abspath
(fileid)[source]¶ Return the absolute path for the given file.
Parameters: fileid (str) – The file identifier for the file whose path should be returned. Return type: PathPointer
-
abspaths
(fileids=None, include_encoding=False, include_fileid=False)[source]¶ Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.
Parameters: - fileids (None or str or list) – Specifies the set of fileids for which paths should
be returned. Can be None, for all fileids; a list of
file identifiers, for a specified set of fileids; or a single
file identifier, for a single file. Note that the return
value is always a list of paths, even if
fileids
is a single file identifier. - include_encoding – If true, then return a list of
(path_pointer, encoding)
tuples.
Return type: - fileids (None or str or list) – Specifies the set of fileids for which paths should
be returned. Can be None, for all fileids; a list of
file identifiers, for a specified set of fileids; or a single
file identifier, for a single file. Note that the return
value is always a list of paths, even if
-
encoding
(file)[source]¶ Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.
-
ensure_loaded
()[source]¶ Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).
-
open
(file)[source]¶ Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.
Parameters: file – The file identifier of the file to read.
-
root
¶ The directory where this corpus is stored.
Type: PathPointer
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.api.
SyntaxCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:
__init__
, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files._read_block
, which reads a block from the input stream._word
, which takes a block and returns a list of list of words._tag
, which takes a block and returns a list of list of tagged words._parse
, which takes a block and returns a list of parsed sentences.
nltk.corpus.reader.bnc module¶
Corpus reader for the XML version of the British National Corpus.
-
class
nltk.corpus.reader.bnc.
BNCCorpusReader
(root, fileids, lazy=True)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for the XML version of the British National Corpus.
For access to the complete XML data structure, use the
xml()
method. For access to simple word lists and tagged word lists, usewords()
,sents()
,tagged_words()
, andtagged_sents()
.You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554
If you extracted the archive to a directory called BNC, then you can instantiate the reader as:
BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
-
sents
(fileids=None, strip_space=True, stem=False)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type: Parameters: - strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
tagged_sents
(fileids=None, c5=False, strip_space=True, stem=False)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.Return type: Parameters: - c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
tagged_words
(fileids=None, c5=False, strip_space=True, stem=False)[source]¶ Returns: the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples
(word,tag)
.Return type: Parameters: - c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
words
(fileids=None, strip_space=True, stem=False)[source]¶ Returns: the given file(s) as a list of words and punctuation symbols.
Return type: Parameters: - strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
-
class
nltk.corpus.reader.bnc.
BNCSentence
(num, items)[source]¶ Bases:
list
A list of words, augmented by an attribute
num
used to record the sentence identifier (then
attribute from the XML).
-
class
nltk.corpus.reader.bnc.
BNCWordView
(fileid, sent, tag, strip_space, stem)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusView
A stream backed corpus view specialized for use with the BNC corpus.
Author of the document.
-
editor
= None¶ Editor
-
resps
= None¶ Statement of responsibility
These tags are ignored. For their description refer to the technical documentation, for example, http://www.natcorp.ox.ac.uk/docs/URG/ref-vocal.html
-
title
= None¶ Title of the document.
nltk.corpus.reader.bracket_parse module¶
Corpus reader for corpora that consist of parenthesis-delineated parse trees.
-
class
nltk.corpus.reader.bracket_parse.
AlpinoCorpusReader
(root, encoding='ISO-8859-1', tagset=None)[source]¶ Bases:
nltk.corpus.reader.bracket_parse.BracketParseCorpusReader
Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter ‘ordered’ to the overridden _normalize function. The _parse function can then remain untouched.
-
class
nltk.corpus.reader.bracket_parse.
BracketParseCorpusReader
(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.SyntaxCorpusReader
Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the “combined” section of the Penn Treebank, e.g. “(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))”.
-
class
nltk.corpus.reader.bracket_parse.
CategorizedBracketParseCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.bracket_parse.BracketParseCorpusReader
A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>
nltk.corpus.reader.categorized_sents module¶
CorpusReader structured for corpora that contain one instance on each row. This CorpusReader is specifically used for the Subjectivity Dataset and the Sentence Polarity Dataset.
- Subjectivity Dataset information -
Authors: Bo Pang and Lillian Lee. Url: http://www.cs.cornell.edu/people/pabo/movie-review-data
Distributed with permission.
Related papers:
- Bo Pang and Lillian Lee. “A Sentimental Education: Sentiment Analysis Using
- Subjectivity Summarization Based on Minimum Cuts”. Proceedings of the ACL, 2004.
- Sentence Polarity Dataset information -
Authors: Bo Pang and Lillian Lee. Url: http://www.cs.cornell.edu/people/pabo/movie-review-data
Related papers:
- Bo Pang and Lillian Lee. “Seeing stars: Exploiting class relationships for
- sentiment categorization with respect to rating scales”. Proceedings of the ACL, 2005.
-
class
nltk.corpus.reader.categorized_sents.
CategorizedSentencesCorpusReader
(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=None, encoding='utf8', **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.api.CorpusReader
A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.
Examples using the Subjectivity Dataset:
>>> from nltk.corpus import subjectivity >>> subjectivity.sents()[23] ['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits', 'happened', 'off', 'screen', '.'] >>> subjectivity.categories() ['obj', 'subj'] >>> subjectivity.words(categories='subj') ['smart', 'and', 'alert', ',', 'thirteen', ...]
Examples using the Sentence Polarity Dataset:
>>> from nltk.corpus import sentence_polarity >>> sentence_polarity.sents() [['simplistic', ',', 'silly', 'and', 'tedious', '.'], ["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.'], ...] >>> sentence_polarity.categories() ['neg', 'pos']
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
raw
(fileids=None, categories=None)[source]¶ Parameters: - fileids – a list or regexp specifying the fileids that have to be returned as a raw string.
- categories – a list specifying the categories whose files have to be returned as a raw string.
Returns: the given file(s) as a single string.
Return type:
-
sents
(fileids=None, categories=None)[source]¶ Return all sentences in the corpus or in the specified file(s).
Parameters: - fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
- categories – a list specifying the categories whose sentences have to be returned.
Returns: the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.
Return type:
-
words
(fileids=None, categories=None)[source]¶ Return all words and punctuation symbols in the corpus or in the specified file(s).
Parameters: - fileids – a list or regexp specifying the ids of the files whose words have to be returned.
- categories – a list specifying the categories whose words have to be returned.
Returns: the given file(s) as a list of words and punctuation symbols.
Return type:
-
nltk.corpus.reader.chasen module¶
-
class
nltk.corpus.reader.chasen.
ChasenCorpusReader
(root, fileids, encoding='utf8', sent_splitter=None)[source]¶
-
class
nltk.corpus.reader.chasen.
ChasenCorpusView
(corpus_file, encoding, tagged, group_by_sent, group_by_para, sent_splitter=None)[source]¶ Bases:
nltk.corpus.reader.util.StreamBackedCorpusView
A specialized corpus view for ChasenReader. Similar to
TaggedCorpusView
, but this’ll use fixed sets of word and sentence tokenizer.
nltk.corpus.reader.childes module¶
Corpus reader for the XML version of the CHILDES corpus.
-
class
nltk.corpus.reader.childes.
CHILDESCorpusReader
(root, fileids, lazy=True)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at
http://childes.psy.cmu.edu/
. The XML version of CHILDES is located athttp://childes.psy.cmu.edu/data-xml/
. Copy the needed parts of the CHILDES XML corpus into the NLTK data directory (nltk_data/corpora/CHILDES/
).For access to the file text use the usual nltk functions,
words()
,sents()
,tagged_words()
andtagged_sents()
.-
MLU
(fileids=None, speaker='CHI')[source]¶ Returns: the given file(s) as a floating number Return type: list(float)
-
age
(fileids=None, speaker='CHI', month=False)[source]¶ Returns: the given file(s) as string or int Return type: list or int Parameters: month – If true, return months instead of year-month-date
-
childes_url_base
= 'http://childes.psy.cmu.edu/browser/index.php?url='¶
-
corpus
(fileids=None)[source]¶ Returns: the given file(s) as a dict of (corpus_property_key, value)
Return type: list(dict)
-
participants
(fileids=None)[source]¶ Returns: the given file(s) as a dict of (participant_property_key, value)
Return type: list(dict)
-
sents
(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type: Parameters: - speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
- stem – If true, then use word stems instead of word strings.
- relation – If true, then return tuples of
(str,pos,relation_list)
. If there is manually-annotated relation info, it will return tuples of(str,pos,test_relation_list,str,pos,gold_relation_list)
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
-
tagged_sents
(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.Return type: Parameters: - speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
- stem – If true, then use word stems instead of word strings.
- relation – If true, then return tuples of
(str,pos,relation_list)
. If there is manually-annotated relation info, it will return tuples of(str,pos,test_relation_list,str,pos,gold_relation_list)
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
-
tagged_words
(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]¶ Returns: the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples
(word,tag)
.Return type: Parameters: - speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
- stem – If true, then use word stems instead of word strings.
- relation – If true, then return tuples of (stem, index, dependent_index)
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
-
webview_file
(fileid, urlbase=None)[source]¶ Map a corpus file to its web version on the CHILDES website, and open it in a web browser.
- The complete URL to be used is:
- childes.childes_url_base + urlbase + fileid.replace(‘.xml’, ‘.cha’)
If no urlbase is passed, we try to calculate it. This requires that the childes corpus was set up to mirror the folder hierarchy under childes.psy.cmu.edu/data-xml/, e.g.: nltk_data/corpora/childes/Eng-USA/Cornell/??? or nltk_data/corpora/childes/Romance/Spanish/Aguirre/???
The function first looks (as a special case) if “Eng-USA” is on the path consisting of <corpus root>+fileid; then if “childes”, possibly followed by “data-xml”, appears. If neither one is found, we use the unmodified fileid and hope for the best. If this is not right, specify urlbase explicitly, e.g., if the corpus root points to the Cornell folder, urlbase=’Eng-USA/Cornell’.
-
words
(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]¶ Returns: the given file(s) as a list of words
Return type: Parameters: - speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
- stem – If true, then use word stems instead of word strings.
- relation – If true, then return tuples of (stem, index, dependent_index)
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
-
nltk.corpus.reader.chunked module¶
A reader for corpora that contain chunked (and optionally tagged) documents.
-
class
nltk.corpus.reader.chunked.
ChunkedCorpusReader
(root, fileids, extension='', str2chunktree=<function tagstr2tree>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using
nltk.chunk.tagstr2tree
.-
chunked_paras
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags).Return type: list(list(Tree))
-
chunked_sents
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags).Return type: list(Tree)
-
chunked_words
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of tagged words and chunks. Words are encoded as (word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over(word,tag)
tuples or word strings.Return type: list(tuple(str,str) and Tree)
-
paras
(fileids=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. Return type: list(list(list(str)))
-
sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. Return type: list(list(str))
-
tagged_paras
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag)
tuples.Return type: list(list(list(tuple(str,str))))
-
tagged_sents
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of (word,tag)
tuples.Return type: list(list(tuple(str,str)))
-
nltk.corpus.reader.cmudict module¶
The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6] ftp://ftp.cs.cmu.edu/project/speech/dict/ Copyright 1998 Carnegie Mellon University
File Format: Each line consists of an uppercased word, a counter (for alternative pronunciations), and a transcription. Vowels are marked for stress (1=primary, 2=secondary, 0=no stress). E.g.: NATURAL 1 N AE1 CH ER0 AH0 L
The dictionary contains 127069 entries. Of these, 119400 words are assigned a unique pronunciation, 6830 words have two pronunciations, and 839 words have three or more pronunciations. Many of these are fast-speech variants.
Phonemes: There are 39 phonemes, as shown below:
Phoneme Example Translation Phoneme Example Translation ——- ——- ———– ——- ——- ———– AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY K key K IY L lee L IY M me M IY N knee N IY NG ping P IH NG OW oat OW T OY toy T OY P pee P IY R read R IY D S sea S IY SH she SH IY T tea T IY TH theta TH EY T AH UH hood HH UH D UW two T UW V vee V IY W we W IY Y yield Y IY L D Z zee Z IY ZH seizure S IY ZH ER
-
class
nltk.corpus.reader.cmudict.
CMUDictCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
-
dict
()[source]¶ Returns: the cmudict lexicon as a dictionary, whose keys are lowercase words and whose values are lists of pronunciations.
-
nltk.corpus.reader.comparative_sents module¶
CorpusReader for the Comparative Sentence Dataset.
- Comparative Sentence Dataset information -
- Annotated by: Nitin Jindal and Bing Liu, 2006.
- Department of Computer Sicence University of Illinois at Chicago
- Contact: Nitin Jindal, njindal@cs.uic.edu
- Bing Liu, liub@cs.uic.edu (http://www.cs.uic.edu/~liub)
Distributed with permission.
Related papers:
- Nitin Jindal and Bing Liu. “Identifying Comparative Sentences in Text Documents”.
- Proceedings of the ACM SIGIR International Conference on Information Retrieval (SIGIR-06), 2006.
- Nitin Jindal and Bing Liu. “Mining Comprative Sentences and Relations”.
- Proceedings of Twenty First National Conference on Artificial Intelligence (AAAI-2006), 2006.
- Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.
- Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.
-
class
nltk.corpus.reader.comparative_sents.
ComparativeSentencesCorpusReader
(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=None, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).
>>> from nltk.corpus import comparative_sentences >>> comparison = comparative_sentences.comparisons()[0] >>> comparison.text ['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly', 'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve", 'had', '.'] >>> comparison.entity_2 'models' >>> (comparison.feature, comparison.keyword) ('rewind', 'more') >>> len(comparative_sentences.comparisons()) 853
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
comparisons
(fileids=None)[source]¶ Return all comparisons in the corpus.
Parameters: fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned. Returns: the given file(s) as a list of Comparison objects. Return type: list(Comparison)
-
keywords
(fileids=None)[source]¶ Return a set of all keywords used in the corpus.
Parameters: fileids – a list or regexp specifying the ids of the files whose keywords have to be returned. Returns: the set of keywords and comparative phrases used in the corpus. Return type: set(str)
-
keywords_readme
()[source]¶ Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).
-
raw
(fileids=None)[source]¶ Parameters: fileids – a list or regexp specifying the fileids that have to be returned as a raw string. Returns: the given file(s) as a single string. Return type: str
-
sents
(fileids=None)[source]¶ Return all sentences in the corpus.
Parameters: fileids – a list or regexp specifying the ids of the files whose sentences have to be returned. Returns: all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified). Return type: list(list(str)) or list(str)
-
nltk.corpus.reader.conll module¶
Read CoNLL-style chunk fileids.
-
class
nltk.corpus.reader.conll.
ConllChunkCorpusReader
(root, fileids, chunk_types, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.conll.ConllCorpusReader
A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.
-
class
nltk.corpus.reader.conll.
ConllCorpusReader
(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.Tree'>, tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the
ConllCorpusReader
constructor therefore takes an argument,columntypes
, which is used to specify the columns that are used by a given corpus.- @todo: Add support for reading from corpora where different
- parallel files contain different columns.
- @todo: Possibly add caching of the grid corpus view? This would
- allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
- @todo: Better support for -DOCSTART-. Currently, we just ignore
- it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).
-
CHUNK
= 'chunk'¶ column type for chunk structures
-
COLUMN_TYPES
= ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')¶ A list of all column types supported by the conll corpus reader.
-
IGNORE
= 'ignore'¶ column type for column that should be ignored
-
NE
= 'ne'¶ column type for named entities
-
POS
= 'pos'¶ column type for part-of-speech tags
-
SRL
= 'srl'¶ column type for semantic role labels
-
TREE
= 'tree'¶ column type for parse trees
-
WORDS
= 'words'¶ column type for words
-
iob_sents
(fileids=None, tagset=None)[source]¶ Returns: a list of lists of word/tag/IOB tuples Return type: list(list) Parameters: fileids (None or str or list) – the list of fileids that make up this corpus
-
class
nltk.corpus.reader.conll.
ConllSRLInstance
(tree, verb_head, verb_stem, roleset, tagged_spans)[source]¶ Bases:
object
An SRL instance from a CoNLL corpus, which identifies and providing labels for the arguments of a single verb.
-
arguments
= None¶ A list of
(argspan, argid)
tuples, specifying the location and type for each of the arguments identified by this instance.argspan
is a tuplestart, end
, indicating that the argument consists of thewords[start:end]
.
-
tagged_spans
= None¶ A list of
(span, id)
tuples, specifying the location and type for each of the arguments, as well as the verb pieces, that make up this instance.
-
tree
= None¶ The parse tree for the sentence containing this instance.
-
unicode_repr
()¶
-
verb
= None¶ A list of the word indices of the words that compose the verb whose arguments are identified by this instance. This will contain multiple word indices when multi-word verbs are used (e.g. ‘turn on’).
-
verb_head
= None¶ The word index of the head word of the verb whose arguments are identified by this instance. E.g., for a sentence that uses the verb ‘turn on,’
verb_head
will be the word index of the word ‘turn’.
-
words
= None¶ A list of the words in the sentence containing this instance.
-
nltk.corpus.reader.crubadan module¶
An NLTK interface for the n-gram statistics gathered from the corpora for each language using An Crubadan.
There are multiple potential applications for the data but this reader was created with the goal of using it in the context of language identification.
For details about An Crubadan, this data, and its potential uses, see: http://borel.slu.edu/crubadan/index.html
-
class
nltk.corpus.reader.crubadan.
CrubadanCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader used to access language An Crubadan n-gram files.
nltk.corpus.reader.dependency module¶
-
class
nltk.corpus.reader.dependency.
DependencyCorpusReader
(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block>)[source]¶
nltk.corpus.reader.framenet module¶
Corpus reader for the FrameNet 1.7 lexicon and corpus.
-
class
nltk.corpus.reader.framenet.
AttrDict
(*args, **kwargs)[source]¶ Bases:
dict
A class that wraps a dict and allows accessing the keys of the dict as if they were attributes. Taken from here:
>>> foo = {'a':1, 'b':2, 'c':3} >>> bar = AttrDict(foo) >>> pprint(dict(bar)) {'a': 1, 'b': 2, 'c': 3} >>> bar.b 2 >>> bar.d = 4 >>> pprint(dict(bar)) {'a': 1, 'b': 2, 'c': 3, 'd': 4}
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.framenet.
FramenetCorpusReader
(root, fileids)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
A corpus reader for the Framenet Corpus.
>>> from nltk.corpus import framenet as fn >>> fn.lu(3238).frame.lexUnit['glint.v'] is fn.lu(3238) True >>> fn.frame_by_name('Replacing') is fn.lus('replace.v')[0].frame True >>> fn.lus('prejudice.n')[0].frame.frameRelations == fn.frame_relations('Partiality') True
-
annotations
(luNamePattern=None, exemplars=True, full_text=True)[source]¶ Frame annotation sets matching the specified criteria.
-
doc
(fn_docid)[source]¶ Returns the annotated document whose id number is
fn_docid
. This id number can be obtained by calling the Documents() function.The dict that is returned from this function will contain the following keys:
‘_type’ : ‘fulltextannotation’
- ‘sentence’ : a list of sentences in the document
- Each item in the list is a dict containing the following keys:
‘ID’ : the ID number of the sentence
‘_type’ : ‘sentence’
‘text’ : the text of the sentence
‘paragNo’ : the paragraph number
‘sentNo’ : the sentence number
‘docID’ : the document ID number
‘corpID’ : the corpus ID number
‘aPos’ : the annotation position
- ‘annotationSet’ : a list of annotation layers for the sentence
- Each item in the list is a dict containing the following keys:
‘ID’ : the ID number of the annotation set
‘_type’ : ‘annotationset’
‘status’ : either ‘MANUAL’ or ‘UNANN’
‘luName’ : (only if status is ‘MANUAL’)
‘luID’ : (only if status is ‘MANUAL’)
‘frameID’ : (only if status is ‘MANUAL’)
‘frameName’: (only if status is ‘MANUAL’)
- ‘layer’ : a list of labels for the layer
Each item in the layer is a dict containing the following keys:
- ‘_type’: ‘layer’
- ‘rank’
- ‘name’
- ‘label’ : a list of labels in the layer
- Each item is a dict containing the following keys:
- ‘start’
- ‘end’
- ‘name’
- ‘feID’ (optional)
Parameters: fn_docid (int) – The Framenet id number of the document Returns: Information about the annotated document Return type: dict
-
docs
(name=None)[source]¶ Return a list of the annotated full-text documents in FrameNet, optionally filtered by a regex to be matched against the document name.
-
docs_metadata
(name=None)[source]¶ Return an index of the annotated documents in Framenet.
Details for a specific annotated document can be obtained using this class’s doc() function and pass it the value of the ‘ID’ field.
>>> from nltk.corpus import framenet as fn >>> len(fn.docs()) in (78, 107) # FN 1.5 and 1.7, resp. True >>> set([x.corpname for x in fn.docs_metadata()])>=set(['ANC', 'KBEval', 'LUCorpus-v0.3', 'Miscellaneous', 'NTI', 'PropBank']) True
Parameters: name (str) – A regular expression pattern used to search the file name of each annotated document. The document’s file name contains the name of the corpus that the document is from, followed by two underscores “__” followed by the document name. So, for example, the file name “LUCorpus-v0.3__20000410_nyt-NEW.xml” is from the corpus named “LUCorpus-v0.3” and the document name is “20000410_nyt-NEW.xml”. Returns: A list of selected (or all) annotated documents Return type: list of dicts, where each dict object contains the following keys: - ‘name’
- ‘ID’
- ‘corpid’
- ‘corpname’
- ‘description’
- ‘filename’
-
exemplars
(luNamePattern=None, frame=None, fe=None, fe2=None)[source]¶ Lexicographic exemplar sentences, optionally filtered by LU name and/or 1-2 FEs that are realized overtly. ‘frame’ may be a name pattern, frame ID, or frame instance. ‘fe’ may be a name pattern or FE instance; if specified, ‘fe2’ may also be specified to retrieve sentences with both overt FEs (in either order).
-
fe_relations
()[source]¶ Obtain a list of frame element relations.
>>> from nltk.corpus import framenet as fn >>> ferels = fn.fe_relations() >>> isinstance(ferels, list) True >>> len(ferels) in (10020, 12393) # FN 1.5 and 1.7, resp. True >>> PrettyDict(ferels[0], breakLines=True) {'ID': 14642, '_type': 'ferelation', 'frameRelation': <Parent=Abounding_with -- Inheritance -> Child=Lively_place>, 'subFE': <fe ID=11370 name=Degree>, 'subFEName': 'Degree', 'subFrame': <frame ID=1904 name=Lively_place>, 'subID': 11370, 'supID': 2271, 'superFE': <fe ID=2271 name=Degree>, 'superFEName': 'Degree', 'superFrame': <frame ID=262 name=Abounding_with>, 'type': <framerelationtype ID=1 name=Inheritance>}
Returns: A list of all of the frame element relations in framenet Return type: list(dict)
-
fes
(name=None, frame=None)[source]¶ Lists frame element objects. If ‘name’ is provided, this is treated as a case-insensitive regular expression to filter by frame name. (Case-insensitivity is because casing of frame element names is not always consistent across frames.) Specify ‘frame’ to filter by a frame name pattern, ID, or object.
>>> from nltk.corpus import framenet as fn >>> fn.fes('Noise_maker') [<fe ID=6043 name=Noise_maker>] >>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound')]) [('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'), ('Make_noise', 'Sound_source'), ('Sound_movement', 'Location_of_sound_source'), ('Sound_movement', 'Sound'), ('Sound_movement', 'Sound_source'), ('Sounds', 'Component_sound'), ('Sounds', 'Location_of_sound_source'), ('Sounds', 'Sound_source'), ('Vocalizations', 'Location_of_sound_source'), ('Vocalizations', 'Sound_source')] >>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound',r'(?i)make_noise')]) [('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'), ('Make_noise', 'Sound_source')] >>> sorted(set(fe.name for fe in fn.fes('^sound'))) ['Sound', 'Sound_maker', 'Sound_source'] >>> len(fn.fes('^sound$')) 2
Parameters: name (str) – A regular expression pattern used to match against frame element names. If ‘name’ is None, then a list of all frame elements will be returned. Returns: A list of matching frame elements Return type: list(AttrDict)
-
frame
(fn_fid_or_fname, ignorekeys=[])[source]¶ Get the details for the specified Frame using the frame’s name or id number.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame(256) >>> f.name 'Medical_specialties' >>> f = fn.frame('Medical_specialties') >>> f.ID 256 >>> # ensure non-ASCII character in definition doesn't trigger an encoding error: >>> fn.frame('Imposing_obligation') frame (1494): Imposing_obligation...
The dict that is returned from this function will contain the following information about the Frame:
‘name’ : the name of the Frame (e.g. ‘Birth’, ‘Apply_heat’, etc.)
‘definition’ : textual definition of the Frame
‘ID’ : the internal ID number of the Frame
- ‘semTypes’ : a list of semantic types for this frame
- Each item in the list is a dict containing the following keys:
- ‘name’ : can be used with the semtype() function
- ‘ID’ : can be used with the semtype() function
- ‘lexUnit’ : a dict containing all of the LUs for this frame.
The keys in this dict are the names of the LUs and the value for each key is itself a dict containing info about the LU (see the lu() function for more info.)
- ‘FE’ : a dict containing the Frame Elements that are part of this frame
The keys in this dict are the names of the FEs (e.g. ‘Body_system’) and the values are dicts containing the following keys
- ‘definition’ : The definition of the FE
- ‘name’ : The name of the FE e.g. ‘Body_system’
- ‘ID’ : The id number
- ‘_type’ : ‘fe’
- ‘abbrev’ : Abbreviation e.g. ‘bod’
- ‘coreType’ : one of “Core”, “Peripheral”, or “Extra-Thematic”
- ‘semType’ : if not None, a dict with the following two keys:
- ‘name’ : name of the semantic type. can be used with
- the semtype() function
- ‘ID’ : id number of the semantic type. can be used with
- the semtype() function
- ‘requiresFE’ : if not None, a dict with the following two keys:
- ‘name’ : the name of another FE in this frame
- ‘ID’ : the id of the other FE in this frame
- ‘excludesFE’ : if not None, a dict with the following two keys:
- ‘name’ : the name of another FE in this frame
- ‘ID’ : the id of the other FE in this frame
‘frameRelation’ : a list of objects describing frame relations
- ‘FEcoreSets’ : a list of Frame Element core sets for this frame
- Each item in the list is a list of FE objects
Parameters: Returns: Information about a frame
Return type:
-
frame_by_id
(fn_fid, ignorekeys=[])[source]¶ Get the details for the specified Frame using the frame’s id number.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame_by_id(256) >>> f.ID 256 >>> f.name 'Medical_specialties' >>> f.definition "This frame includes words that name ..."
Parameters: Returns: Information about a frame
Return type: Also see the
frame()
function for details about what is contained in the dict that is returned.
-
frame_by_name
(fn_fname, ignorekeys=[], check_cache=True)[source]¶ Get the details for the specified Frame using the frame’s name.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame_by_name('Medical_specialties') >>> f.ID 256 >>> f.name 'Medical_specialties' >>> f.definition "This frame includes words that name ..."
Parameters: Returns: Information about a frame
Return type: Also see the
frame()
function for details about what is contained in the dict that is returned.
-
frame_ids_and_names
(name=None)[source]¶ Uses the frame index, which is much faster than looking up each frame definition if only the names and IDs are needed.
-
frame_relation_types
()[source]¶ Obtain a list of frame relation types.
>>> from nltk.corpus import framenet as fn >>> frts = list(fn.frame_relation_types()) >>> isinstance(frts, list) True >>> len(frts) in (9, 10) # FN 1.5 and 1.7, resp. True >>> PrettyDict(frts[0], breakLines=True) {'ID': 1, '_type': 'framerelationtype', 'frameRelations': [<Parent=Event -- Inheritance -> Child=Change_of_consistency>, <Parent=Event -- Inheritance -> Child=Rotting>, ...], 'name': 'Inheritance', 'subFrameName': 'Child', 'superFrameName': 'Parent'}
Returns: A list of all of the frame relation types in framenet Return type: list(dict)
-
frame_relations
(frame=None, frame2=None, type=None)[source]¶ Parameters: frame – (optional) frame object, name, or ID; only relations involving this frame will be returned :param frame2: (optional; ‘frame’ must be a different frame) only show relations between the two specified frames, in either direction :param type: (optional) frame relation type (name or object); show only relations of this type :type frame: int or str or AttrDict :return: A list of all of the frame relations in framenet :rtype: list(dict)
>>> from nltk.corpus import framenet as fn >>> frels = fn.frame_relations() >>> isinstance(frels, list) True >>> len(frels) in (1676, 2070) # FN 1.5 and 1.7, resp. True >>> PrettyList(fn.frame_relations('Cooking_creation'), maxReprSize=0, breakLines=True) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>, <Parent=Apply_heat -- Using -> Child=Cooking_creation>, <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>] >>> PrettyList(fn.frame_relations(274), breakLines=True) [<Parent=Avoiding -- Inheritance -> Child=Dodging>, <Parent=Avoiding -- Inheritance -> Child=Evading>, ...] >>> PrettyList(fn.frame_relations(fn.frame('Cooking_creation')), breakLines=True) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>, <Parent=Apply_heat -- Using -> Child=Cooking_creation>, ...] >>> PrettyList(fn.frame_relations('Cooking_creation', type='Inheritance')) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>] >>> PrettyList(fn.frame_relations('Cooking_creation', 'Apply_heat'), breakLines=True) [<Parent=Apply_heat -- Using -> Child=Cooking_creation>, <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
-
frames
(name=None)[source]¶ Obtain details for a specific frame.
>>> from nltk.corpus import framenet as fn >>> len(fn.frames()) in (1019, 1221) # FN 1.5 and 1.7, resp. True >>> x = PrettyList(fn.frames(r'(?i)crim'), maxReprSize=0, breakLines=True) >>> x.sort(key=lambda f: f.ID) >>> x [<frame ID=200 name=Criminal_process>, <frame ID=500 name=Criminal_investigation>, <frame ID=692 name=Crime_scenario>, <frame ID=700 name=Committing_crime>]
A brief intro to Frames (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):
A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and props that are needed for that Frame. For example, the “Apply_heat” frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.
We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called “lexical units” (LUs).
FrameNet includes relations between Frames. Several types of relations are defined, of which the most important are:
- Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
- Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to child FEs.
- Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.
- Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point of view, respectively.
Parameters: name (str) – A regular expression pattern used to match against Frame names. If ‘name’ is None, then a list of all Framenet Frames will be returned. Returns: A list of matching Frames (or all Frames). Return type: list(AttrDict)
-
frames_by_lemma
(pat)[source]¶ Returns a list of all frames that contain LUs in which the
name
attribute of the LU matchs the given regular expressionpat
. Note that LU names are composed of “lemma.POS”, where the “lemma” part can be made up of either a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’).Note: if you are going to be doing a lot of this type of searching, you’d want to build an index that maps from lemmas to frames because each time frames_by_lemma() is called, it has to search through ALL of the frame XML files in the db.
>>> from nltk.corpus import framenet as fn >>> fn.frames_by_lemma(r'(?i)a little') [<frame ID=189 name=Quanti...>, <frame ID=2001 name=Degree>]
Returns: A list of frame objects. Return type: list(AttrDict)
-
ft_sents
(docNamePattern=None)[source]¶ Full-text annotation sentences, optionally filtered by document name.
-
lu
(fn_luid, ignorekeys=[], luName=None, frameID=None, frameName=None)[source]¶ Access a lexical unit by its ID. luName, frameID, and frameName are used only in the event that the LU does not have a file in the database (which is the case for LUs with “Problem” status); in this case, a placeholder LU is created which just contains its name, ID, and frame.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> fn.lu(256).name 'foresee.v' >>> fn.lu(256).definition 'COD: be aware of beforehand; predict.' >>> fn.lu(256).frame.name 'Expectation' >>> pprint(list(map(PrettyDict, fn.lu(256).lexemes))) [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}]
>>> fn.lu(227).exemplars[23] exemplar sentence (352962): [sentNo] 0 [aPos] 59699508 [LU] (227) guess.v in Coming_to_believe [frame] (23) Coming_to_believe [annotationSet] 2 annotation sets [POS] 18 tags [POS_tagset] BNC [GF] 3 relations [PT] 3 phrases [Other] 1 entry [text] + [Target] + [FE] When he was inside the house , Culley noticed the characteristic ------------------ Content he would n't have guessed at . -- ******* -- Co C1 [Evidence:INI] (Co=Cognizer, C1=Content)
The dict that is returned from this function will contain most of the following information about the LU. Note that some LUs do not contain all of these pieces of information - particularly ‘totalAnnotated’ and ‘incorporatedFE’ may be missing in some LUs:
‘name’ : the name of the LU (e.g. ‘merger.n’)
‘definition’ : textual definition of the LU
‘ID’ : the internal ID number of the LU
‘_type’ : ‘lu’
‘status’ : e.g. ‘Created’
‘frame’ : Frame that this LU belongs to
‘POS’ : the part of speech of this LU (e.g. ‘N’)
‘totalAnnotated’ : total number of examples annotated with this LU
‘incorporatedFE’ : FE that incorporates this LU (e.g. ‘Ailment’)
- ‘sentenceCount’ : a dict with the following two keys:
- ‘annotated’: number of sentences annotated with this LU
- ‘total’ : total number of sentences with this LU
- ‘lexemes’ : a list of dicts describing the lemma of this LU.
Each dict in the list contains these keys: - ‘POS’ : part of speech e.g. ‘N’ - ‘name’ : either single-lexeme e.g. ‘merger’ or
multi-lexeme e.g. ‘a little’
‘order’: the order of the lexeme in the lemma (starting from 1)
‘headword’: a boolean (‘true’ or ‘false’)
- ‘breakBefore’: Can this lexeme be separated from the previous lexeme?
- Consider: “take over.v” as in:
Germany took over the Netherlands in 2 days. Germany took the Netherlands over in 2 days.
In this case, ‘breakBefore’ would be “true” for the lexeme “over”. Contrast this with “take after.v” as in:
Mary takes after her grandmother.
*Mary takes her grandmother after.
In this case, ‘breakBefore’ would be “false” for the lexeme “after”
‘lemmaID’ : Can be used to connect lemmas in different LUs
‘semTypes’ : a list of semantic type objects for this LU
- ‘subCorpus’ : a list of subcorpora
- Each item in the list is a dict containing the following keys:
- ‘name’ :
- ‘sentence’ : a list of sentences in the subcorpus
- each item in the list is a dict with the following keys:
- ‘ID’:
- ‘sentNo’:
- ‘text’: the text of the sentence
- ‘aPos’:
- ‘annotationSet’: a list of annotation sets
- each item in the list is a dict with the following keys:
- ‘ID’:
- ‘status’:
- ‘layer’: a list of layers
- each layer is a dict containing the following keys:
- ‘name’: layer name (e.g. ‘BNC’)
- ‘rank’:
- ‘label’: a list of labels for the layer
- each label is a dict containing the following keys:
- ‘start’: start pos of label in sentence ‘text’ (0-based)
- ‘end’: end pos of label in sentence ‘text’ (0-based)
- ‘name’: name of label (e.g. ‘NN1’)
Under the hood, this implementation looks up the lexical unit information in the frame definition file. That file does not contain corpus annotations, so the LU files will be accessed on demand if those are needed. In principle, valence patterns could be loaded here too, though these are not currently supported.
Parameters: Returns: All information about the lexical unit
Return type:
-
lu_basic
(fn_luid)[source]¶ Returns basic information about the LU whose id is
fn_luid
. This is basically just a wrapper around thelu()
function with “subCorpus” info excluded.>>> from nltk.corpus import framenet as fn >>> lu = PrettyDict(fn.lu_basic(256), breakLines=True) >>> # ellipses account for differences between FN 1.5 and 1.7 >>> lu {'ID': 256, 'POS': 'V', 'URL': u'https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu256.xml', '_type': 'lu', 'cBy': ..., 'cDate': '02/08/2001 01:27:50 PST Thu', 'definition': 'COD: be aware of beforehand; predict.', 'definitionMarkup': 'COD: be aware of beforehand; predict.', 'frame': <frame ID=26 name=Expectation>, 'lemmaID': 15082, 'lexemes': [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}], 'name': 'foresee.v', 'semTypes': [], 'sentenceCount': {'annotated': ..., 'total': ...}, 'status': 'FN1_Sent'}
Parameters: fn_luid (int) – The id number of the desired LU Returns: Basic information about the lexical unit Return type: dict
-
lu_ids_and_names
(name=None)[source]¶ Uses the LU index, which is much faster than looking up each LU definition if only the names and IDs are needed.
-
lus
(name=None, frame=None)[source]¶ Obtain details for lexical units. Optionally restrict by lexical unit name pattern, and/or to a certain frame or frames whose name matches a pattern.
>>> from nltk.corpus import framenet as fn >>> len(fn.lus()) in (11829, 13572) # FN 1.5 and 1.7, resp. True >>> PrettyList(fn.lus(r'(?i)a little'), maxReprSize=0, breakLines=True) [<lu ID=14744 name=a little bit.adv>, <lu ID=14733 name=a little.n>, <lu ID=14743 name=a little.adv>] >>> fn.lus(r'interest', r'(?i)stimulus') [<lu ID=14920 name=interesting.a>, <lu ID=14894 name=interested.a>]
A brief intro to Lexical Units (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):
A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat” Frame describes a common situation involving a Cook, some Food, and a Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of a polysemous word is a different LU.
We have used the word “word” in talking about LUs. The reality is actually rather complex. When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different frames:
- Apply_heat: “Michelle baked the potatoes for 45 minutes.”
- Cooking_creation: “Michelle baked her mother a cake for her birthday.”
- Absorb_heat: “The potatoes have to bake for more than 30 minutes.”
These constitute three different LUs, with different definitions.
Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively), and their internal structure is not analyzed.
Framenet provides multiple annotated examples of each sense of a word (i.e. each LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial possibilities of the lexical unit.
Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database similar to a thesaurus, grouping together semantically similar words.
In the simplest case, frame-evoking words are verbs such as “fried” in:
“Matilde fried the catfish in a heavy iron skillet.”Sometimes event nouns may evoke a Frame. For example, “reduction” evokes “Cause_change_of_scalar_position” in:
”...the reduction of debt levels to $665 million from $2.6 billion.”Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as in:
“They were asleep for hours.”Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents rather than clearly evoking their own frames.
Parameters: name (str) – A regular expression pattern used to search the LU names. Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a lemma preceeds the ”.” and a POS follows the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). If ‘name’ is not given, then all LUs will be returned.
The valid POSes are:
v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj - interjection art - article c - conjunction scon - subordinating conjunctionReturns: A list of selected (or all) lexical units Return type: list of LU objects (dicts) See the lu() function for info about the specifics of LU objects.
-
propagate_semtypes
()[source]¶ Apply inference rules to distribute semtypes over relations between FEs. For FrameNet 1.5, this results in 1011 semtypes being propagated. (Not done by default because it requires loading all frame files, which takes several seconds. If this needed to be fast, it could be rewritten to traverse the neighboring relations on demand for each FE semtype.)
>>> from nltk.corpus import framenet as fn >>> x = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType) >>> fn.propagate_semtypes() >>> y = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType) >>> y-x > 1000 True
-
semtype
(key)[source]¶ >>> from nltk.corpus import framenet as fn >>> fn.semtype(233).name 'Temperature' >>> fn.semtype(233).abbrev 'Temp' >>> fn.semtype('Temperature').ID 233
Parameters: key (string or int) – The name, abbreviation, or id number of the semantic type Returns: Information about a semantic type Return type: dict
-
semtypes
()[source]¶ Obtain a list of semantic types.
>>> from nltk.corpus import framenet as fn >>> stypes = fn.semtypes() >>> len(stypes) in (73, 109) # FN 1.5 and 1.7, resp. True >>> sorted(stypes[0].keys()) ['ID', '_type', 'abbrev', 'definition', 'definitionMarkup', 'name', 'rootType', 'subTypes', 'superType']
Returns: A list of all of the semantic types in framenet Return type: list(dict)
-
warnings
(v)[source]¶ Enable or disable warnings of data integrity issues as they are encountered. If v is truthy, warnings will be enabled.
(This is a function rather than just an attribute/property to ensure that if enabling warnings is the first action taken, the corpus reader is instantiated first.)
-
-
exception
nltk.corpus.reader.framenet.
FramenetError
[source]¶ Bases:
Exception
An exception class for framenet-related errors.
-
class
nltk.corpus.reader.framenet.
Future
(loader, *args, **kwargs)[source]¶ Bases:
object
Wraps and acts as a proxy for a value to be loaded lazily (on demand). Adapted from https://gist.github.com/sergey-miryanov/2935416
-
class
nltk.corpus.reader.framenet.
PrettyDict
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.framenet.AttrDict
Displays an abbreviated repr of values where possible. Inherits from AttrDict, so a callable value will be lazily converted to an actual value.
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.framenet.
PrettyLazyConcatenation
(list_of_lists)[source]¶ Bases:
nltk.collections.LazyConcatenation
Displays an abbreviated repr of only the first several elements, not the whole list.
-
unicode_repr
()¶ Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.
-
-
class
nltk.corpus.reader.framenet.
PrettyLazyIteratorList
(it, known_len=None)[source]¶ Bases:
nltk.collections.LazyIteratorList
Displays an abbreviated repr of only the first several elements, not the whole list.
-
unicode_repr
()¶ Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.
-
-
class
nltk.corpus.reader.framenet.
PrettyLazyMap
(function, *lists, **config)[source]¶ Bases:
nltk.collections.LazyMap
Displays an abbreviated repr of only the first several elements, not the whole list.
-
unicode_repr
()¶ Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.
-
-
class
nltk.corpus.reader.framenet.
PrettyList
(*args, **kwargs)[source]¶ Bases:
list
Displays an abbreviated repr of only the first several elements, not the whole list.
-
unicode_repr
()¶ Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.
-
nltk.corpus.reader.ieer module¶
Corpus reader for the Information Extraction and Entity Recognition Corpus.
NIST 1999 Information Extraction: Entity Recognition Evaluation http://www.itl.nist.gov/iad/894.01/tests/ie-er/er_99/er_99.htm
This corpus contains the NEWSWIRE development test data for the NIST 1999 IE-ER Evaluation. The files were taken from the subdirectory: /ie_er_99/english/devtest/newswire/*.ref.nwt and filenames were shortened.
The corpus contains the following files: APW_19980314, APW_19980424, APW_19980429, NYT_19980315, NYT_19980403, and NYT_19980407.
-
class
nltk.corpus.reader.ieer.
IEERCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶
-
class
nltk.corpus.reader.ieer.
IEERDocument
(text, docno=None, doctype=None, date_time=None, headline='')[source]¶ Bases:
object
-
unicode_repr
()¶
-
-
nltk.corpus.reader.ieer.
documents
= ['APW_19980314', 'APW_19980424', 'APW_19980429', 'NYT_19980315', 'NYT_19980403', 'NYT_19980407']¶ A list of all documents in this corpus.
-
nltk.corpus.reader.ieer.
titles
= {'APW_19980424': 'Associated Press Weekly, 24 April 1998', 'APW_19980314': 'Associated Press Weekly, 14 March 1998', 'APW_19980429': 'Associated Press Weekly, 29 April 1998', 'NYT_19980407': 'New York Times, 7 April 1998', 'NYT_19980315': 'New York Times, 15 March 1998', 'NYT_19980403': 'New York Times, 3 April 1998'}¶ A dictionary whose keys are the names of documents in this corpus; and whose values are descriptions of those documents’ contents.
nltk.corpus.reader.indian module¶
Indian Language POS-Tagged Corpus Collected by A Kumaran, Microsoft Research, India Distributed with permission
- Contents:
- Bangla: IIT Kharagpur
- Hindi: Microsoft Research India
- Marathi: IIT Bombay
- Telugu: IIIT Hyderabad
-
class
nltk.corpus.reader.indian.
IndianCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
List of words, one per line. Blank lines are ignored.
nltk.corpus.reader.ipipan module¶
-
class
nltk.corpus.reader.ipipan.
IPIPANCorpusReader
(root, fileids)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.
The corpus includes information about text domain, channel and categories. You can access possible values using
domains()
,channels()
andcategories()
. You can use also this metadata to filter files, e.g.:fileids(channel='prasa')
,fileids(categories='publicystyczny')
.The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.:
tagged_sents(simplify_tags=True)
.Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.:
tagged_paras(one_tag=False)
.You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g.
tagged_words(disamb_only=False)
.The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g.
tagged_words(append_no_space=True)
. As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g.
words(append_space=True)
. As a result either ‘ ‘ or (‘ ‘, ‘space’) will be inserted between tokens.By default, xml entities like " and & are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g.
words(replace_xmlentities=False)
.
nltk.corpus.reader.knbc module¶
-
class
nltk.corpus.reader.knbc.
KNBCorpusReader
(root, fileids, encoding='utf8', morphs2str=<function <lambda>>)[source]¶ Bases:
nltk.corpus.reader.api.SyntaxCorpusReader
- This class implements:
__init__
, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files._read_block
, which reads a block from the input stream._word
, which takes a block and returns a list of list of words._tag
, which takes a block and returns a list of list of tagged words._parse
, which takes a block and returns a list of parsed sentences.
- The structure of tagged words:
- tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others ...)
>>> from nltk.corpus.util import LazyCorpusLoader >>> knbc = LazyCorpusLoader( ... 'knbc/corpus1', ... KNBCorpusReader, ... r'.*/KN.*', ... encoding='euc-jp', ... )
>>> len(knbc.sents()[0]) 9
nltk.corpus.reader.lin module¶
-
class
nltk.corpus.reader.lin.
LinThesaurusCorpusReader
(root, badscore=0.0)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.
-
scored_synonyms
(ngram, fileid=None)[source]¶ Returns a list of scored synonyms (tuples of synonyms and scores) for the current ngram
Parameters: - ngram (C{string}) – ngram to lookup
- fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns: If fileid is specified, list of tuples of scores and synonyms; otherwise, list of tuples of fileids and lists, where inner lists consist of tuples of scores and synonyms.
-
similarity
(ngram1, ngram2, fileid=None)[source]¶ Returns the similarity score for two ngrams.
Parameters: - ngram1 (C{string}) – first ngram to compare
- ngram2 (C{string}) – second ngram to compare
- fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns: If fileid is specified, just the score for the two ngrams; otherwise, list of tuples of fileids and scores.
-
synonyms
(ngram, fileid=None)[source]¶ Returns a list of synonyms for the current ngram.
Parameters: - ngram (C{string}) – ngram to lookup
- fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns: If fileid is specified, list of synonyms; otherwise, list of tuples of fileids and lists, where inner lists contain synonyms.
-
nltk.corpus.reader.mte module¶
A reader for corpora whose documents are in MTE format.
-
class
nltk.corpus.reader.mte.
MTECorpusReader
(root=None, fileids=None, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.tagged.TaggedCorpusReader
Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset
-
lemma_paras
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma) Return type: list(List(List(tuple(str, str))))
-
lemma_sents
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma) Return type: list(list(tuple(str, str)))
-
lemma_words
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma) Return type: list(tuple(str,str))
-
paras
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string Return type: list(list(list(str)))
-
raw
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a single string. Return type: str
-
readme
()[source]¶ Prints some information about this corpus. :return: the content of the attached README file :rtype: str
-
sents
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings Return type: list(list(str))
-
tagged_paras
(fileids=None, tagset='msd', tags='')[source]¶ param fileids: A list specifying the fileids that should be used. Parameters: - tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
- tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples
Return type:
-
tagged_sents
(fileids=None, tagset='msd', tags='')[source]¶ param fileids: A list specifying the fileids that should be used. Parameters: - tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
- tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns: the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples
Return type:
-
tagged_words
(fileids=None, tagset='msd', tags='')[source]¶ param fileids: A list specifying the fileids that should be used. Parameters: - tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
- tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns: the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag)
Return type:
-
-
class
nltk.corpus.reader.mte.
MTECorpusView
(fileid, tagspec, elt_handler=None)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusView
Class for lazy viewing the MTE Corpus.
-
class
nltk.corpus.reader.mte.
MTEFileReader
(file_path)[source]¶ Bases:
object
Class for loading the content of the multext-east corpus. It parses the xml files and does some tag-filtering depending on the given method parameters.
-
ns
= {'tei': 'http://www.tei-c.org/ns/1.0', 'xml': 'http://www.w3.org/XML/1998/namespace'}¶
-
para_path
= 'TEI/text/body/div/div/p'¶
-
sent_path
= 'TEI/text/body/div/div/p/s'¶
-
tag_ns
= '{http://www.tei-c.org/ns/1.0}'¶
-
word_path
= 'TEI/text/body/div/div/p/s/(w|c)'¶
-
xml_ns
= '{http://www.w3.org/XML/1998/namespace}'¶
-
-
class
nltk.corpus.reader.mte.
MTETagConverter
[source]¶ Bases:
object
Class for converting msd tags to universal tags, more conversion options are currently not implemented.
-
mapping_msd_universal
= {'V': 'VERB', '-': 'X', 'Q': 'PRT', 'S': 'ADP', 'C': 'CONJ', '.': '.', 'D': 'DET', 'R': 'ADV', 'N': 'NOUN', 'P': 'PRON', 'M': 'NUM', 'A': 'ADJ'}¶
-
nltk.corpus.reader.nkjp module¶
-
class
nltk.corpus.reader.nkjp.
NKJPCorpusReader
(root, fileids='.*')[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
-
HEADER_MODE
= 2¶
-
RAW_MODE
= 3¶
-
SENTS_MODE
= 1¶
-
WORDS_MODE
= 0¶
-
-
class
nltk.corpus.reader.nkjp.
NKJPCorpus_Morph_View
(filename, **kwargs)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusView
A stream backed corpus view specialized for use with ann_morphosyntax.xml files in NKJP corpus.
-
class
nltk.corpus.reader.nkjp.
NKJPCorpus_Segmentation_View
(filename, **kwargs)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusView
A stream backed corpus view specialized for use with ann_segmentation.xml files in NKJP corpus.
-
class
nltk.corpus.reader.nkjp.
NKJPCorpus_Text_View
(filename, **kwargs)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusView
A stream backed corpus view specialized for use with text.xml files in NKJP corpus.
-
RAW_MODE
= 1¶
-
SENTS_MODE
= 0¶
-
-
class
nltk.corpus.reader.nkjp.
XML_Tool
(root, filename)[source]¶ Bases:
object
Helper class creating xml file to one without references to nkjp: namespace. That’s needed because the XMLCorpusView assumes that one can find short substrings of XML that are valid XML, which is not true if a namespace is declared at top level
nltk.corpus.reader.nombank module¶
-
class
nltk.corpus.reader.nombank.
NombankChainTreePointer
(pieces)[source]¶ Bases:
nltk.corpus.reader.nombank.NombankPointer
-
pieces
= None¶ A list of the pieces that make up this chain. Elements may be either
NombankSplitTreePointer
orNombankTreePointer
pointers.
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.nombank.
NombankCorpusReader
(root, nomfile, framefiles='', nounsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-noun basis. Each “frameset file” contains one or more predicates, such as
'turn'
or'turn_on'
, each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.-
instances
(baseform=None)[source]¶ Returns: a corpus view that acts as a list of NombankInstance
objects, one for each noun in the corpus.
-
lines
()[source]¶ Returns: a corpus view that acts as a list of strings, one for each line in the predicate-argument annotation file.
-
-
class
nltk.corpus.reader.nombank.
NombankInstance
(fileid, sentnum, wordnum, baseform, sensenumber, predicate, predid, arguments, parse_corpus=None)[source]¶ Bases:
object
-
arguments
= None¶ A list of tuples (argloc, argid), specifying the location and identifier for each of the predicate’s argument in the containing sentence. Argument identifiers are strings such as
'ARG0'
or'ARGM-TMP'
. This list does not contain the predicate.
-
baseform
= None¶ The baseform of the predicate.
-
fileid
= None¶ The name of the file containing the parse tree for this instance’s sentence.
-
parse_corpus
= None¶ A corpus reader for the parse trees corresponding to the instances in this nombank corpus.
-
predicate
= None¶ A
NombankTreePointer
indicating the position of this instance’s predicate within its containing sentence.
-
predid
= None¶ Identifier of the predicate.
-
roleset
¶ The name of the roleset used by this instance’s predicate. Use
nombank.roleset() <NombankCorpusReader.roleset>
to look up information about the roleset.
-
sensenumber
= None¶ The sense number of the predicate.
-
sentnum
= None¶ The sentence number of this sentence within
fileid
. Indexing starts from zero.
-
tree
¶ The parse tree corresponding to this instance, or None if the corresponding tree is not available.
-
unicode_repr
()¶
-
wordnum
= None¶ The word number of this instance’s predicate within its containing sentence. Word numbers are indexed starting from zero, and include traces and other empty parse elements.
-
-
class
nltk.corpus.reader.nombank.
NombankPointer
[source]¶ Bases:
object
A pointer used by nombank to identify one or more constituents in a parse tree.
NombankPointer
is an abstract base class with three concrete subclasses:NombankTreePointer
is used to point to single constituents.NombankSplitTreePointer
is used to point to ‘split’ constituents, which consist of a sequence of two or moreNombankTreePointer
pointers.NombankChainTreePointer
is used to point to entire trace chains in a tree. It consists of a sequence of pieces, which can beNombankTreePointer
orNombankSplitTreePointer
pointers.
-
class
nltk.corpus.reader.nombank.
NombankSplitTreePointer
(pieces)[source]¶ Bases:
nltk.corpus.reader.nombank.NombankPointer
-
pieces
= None¶ A list of the pieces that make up this chain. Elements are all
NombankTreePointer
pointers.
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.nombank.
NombankTreePointer
(wordnum, height)[source]¶ Bases:
nltk.corpus.reader.nombank.NombankPointer
wordnum:height*wordnum:height*... wordnum:height,
-
treepos
(tree)[source]¶ Convert this pointer to a standard ‘tree position’ pointer, given that it points to the given tree.
-
unicode_repr
()¶
-
nltk.corpus.reader.nps_chat module¶
nltk.corpus.reader.opinion_lexicon module¶
CorpusReader for the Opinion Lexicon.
- Opinion Lexicon information -
- Authors: Minqing Hu and Bing Liu, 2004.
- Department of Computer Sicence University of Illinois at Chicago
- Contact: Bing Liu, liub@cs.uic.edu
- http://www.cs.uic.edu/~liub
Distributed with permission.
Related papers: - Minqing Hu and Bing Liu. “Mining and summarizing customer reviews”.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-04), Aug 22-25, 2004, Seattle, Washington, USA.
- Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and
- Comparing Opinions on the Web”. Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.
-
class
nltk.corpus.reader.opinion_lexicon.
IgnoreReadmeCorpusView
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.util.StreamBackedCorpusView
This CorpusView is used to skip the initial readme block of the corpus.
-
class
nltk.corpus.reader.opinion_lexicon.
OpinionLexiconCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.wordlist.WordListCorpusReader
Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored.
>>> from nltk.corpus import opinion_lexicon >>> opinion_lexicon.words() ['2-faced', '2-faces', 'abnormal', 'abolish', ...]
The OpinionLexiconCorpusReader provides shortcuts to retrieve positive/negative words:
>>> opinion_lexicon.negative() ['2-faced', '2-faces', 'abnormal', 'abolish', ...]
Note that words from words() method are sorted by file id, not alphabetically:
>>> opinion_lexicon.words()[0:10] ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted'] >>> sorted(opinion_lexicon.words())[0:10] ['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort']
-
CorpusView
¶ alias of
IgnoreReadmeCorpusView
-
negative
()[source]¶ Return all negative words in alphabetical order.
Returns: a list of negative words. Return type: list(str)
-
positive
()[source]¶ Return all positive words in alphabetical order.
Returns: a list of positive words. Return type: list(str)
-
words
(fileids=None)[source]¶ Return all words in the opinion lexicon. Note that these words are not sorted in alphabetical order.
Parameters: fileids – a list or regexp specifying the ids of the files whose words have to be returned. Returns: the given file(s) as a list of words and punctuation symbols. Return type: list(str)
-
nltk.corpus.reader.panlex_lite module¶
CorpusReader for PanLex Lite, a stripped down version of PanLex distributed as an SQLite database. See the README.txt in the panlex_lite corpus directory for more information on PanLex Lite.
-
class
nltk.corpus.reader.panlex_lite.
Meaning
(mn, attr)[source]¶ Bases:
dict
Represents a single PanLex meaning. A meaning is a translation set derived from a single source.
-
class
nltk.corpus.reader.panlex_lite.
PanLexLiteCorpusReader
(root)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
-
MEANING_Q
= '\n SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n ORDER BY dnx2.uq DESC\n '¶
-
TRANSLATION_Q
= '\n SELECT s.tt, sum(s.uq) AS trq FROM (\n SELECT ex2.tt, max(dnx.uq) AS uq\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n GROUP BY ex2.tt, dnx.ui\n ) s\n GROUP BY s.tt\n ORDER BY trq DESC, s.tt\n '¶
-
language_varieties
(lc=None)[source]¶ Return a list of PanLex language varieties.
Parameters: lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned. Returns: the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name. Return type: list(tuple)
-
meanings
(expr_uid, expr_tt)[source]¶ Return a list of meanings for an expression.
Parameters: - expr_uid – the expression’s language variety, as a seven-character uniform identifier.
- expr_tt – the expression’s text.
Returns: a list of Meaning objects.
Return type:
-
translations
(from_uid, from_tt, to_uid)[source]¶ - Return a list of translations for an expression into a single language
- variety.
Parameters: - from_uid – the source expression’s language variety, as a seven-character uniform identifier.
- from_tt – the source expression’s text.
- to_uid – the target language variety, as a seven-character uniform identifier.
- :return a list of translation tuples. The first element is the expression
- text and the second element is the translation quality.
Return type: list(tuple)
-
nltk.corpus.reader.pl196x module¶
-
class
nltk.corpus.reader.pl196x.
Pl196xCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.xmldocs.XMLCorpusReader
-
head_len
= 2770¶
-
textids
(fileids=None, categories=None)[source]¶ In the pl196x corpus each category is stored in single file and thus both methods provide identical functionality. In order to accommodate finer granularity, a non-standard textids() method was implemented. All the main functions can be supplied with a list of required chunks—giving much more control to the user.
-
nltk.corpus.reader.plaintext module¶
A reader for corpora that consist of plaintext documents.
-
class
nltk.corpus.reader.plaintext.
CategorizedPlaintextCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.plaintext.PlaintextCorpusReader
A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.
-
class
nltk.corpus.reader.plaintext.
EuroparlCorpusReader
(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.plaintext.PlaintextCorpusReader
Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from
PlaintextCorpusReader
except that:- Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
- For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
- There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
- The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
-
class
nltk.corpus.reader.plaintext.
PlaintextCorpusReader
(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the
CorpusView
class variable.-
CorpusView
¶ The corpus view class used by this reader. Subclasses of
PlaintextCorpusReader
may specify alternative corpus view classes (e.g., to skip the preface sections of documents.)alias of
StreamBackedCorpusView
-
paras
(fileids=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. Return type: list(list(list(str)))
-
-
class
nltk.corpus.reader.plaintext.
PortugueseCategorizedPlaintextCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader
nltk.corpus.reader.ppattach module¶
Read lines from the Prepositional Phrase Attachment Corpus.
The PP Attachment Corpus contains several files having the format:
sentence_id verb noun1 preposition noun2 attachment
For example:
42960 gives authority to administration V 46742 gives inventors of microchip N
The PP attachment is to the verb phrase (V) or noun phrase (N), i.e.:
(VP gives (NP authority) (PP to administration)) (VP gives (NP inventors (PP of microchip)))
The corpus contains the following files:
training: training set devset: development test set, used for algorithm development. test: test set, used to report results bitstrings: word classes derived from Mutual Information Clustering for the Wall Street Journal.
Ratnaparkhi, Adwait (1994). A Maximum Entropy Model for Prepositional Phrase Attachment. Proceedings of the ARPA Human Language Technology Conference. [http://www.cis.upenn.edu/~adwait/papers/hlt94.ps]
The PP Attachment Corpus is distributed with NLTK with the permission of the author.
nltk.corpus.reader.propbank module¶
-
class
nltk.corpus.reader.propbank.
PropbankChainTreePointer
(pieces)[source]¶ Bases:
nltk.corpus.reader.propbank.PropbankPointer
-
pieces
= None¶ A list of the pieces that make up this chain. Elements may be either
PropbankSplitTreePointer
orPropbankTreePointer
pointers.
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.propbank.
PropbankCorpusReader
(root, propfile, framefiles='', verbsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-verb basis. Each “frameset file” contains one or more predicates, such as
'turn'
or'turn_on'
, each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.-
instances
(baseform=None)[source]¶ Returns: a corpus view that acts as a list of PropBankInstance
objects, one for each noun in the corpus.
-
-
class
nltk.corpus.reader.propbank.
PropbankInflection
(form='-', tense='-', aspect='-', person='-', voice='-')[source]¶ Bases:
object
-
ACTIVE
= 'a'¶
-
FINITE
= 'v'¶
-
FUTURE
= 'f'¶
-
GERUND
= 'g'¶
-
INFINITIVE
= 'i'¶
-
NONE
= '-'¶
-
PARTICIPLE
= 'p'¶
-
PASSIVE
= 'p'¶
-
PAST
= 'p'¶
-
PERFECT
= 'p'¶
-
PERFECT_AND_PROGRESSIVE
= 'b'¶
-
PRESENT
= 'n'¶
-
PROGRESSIVE
= 'o'¶
-
THIRD_PERSON
= '3'¶
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.propbank.
PropbankInstance
(fileid, sentnum, wordnum, tagger, roleset, inflection, predicate, arguments, parse_corpus=None)[source]¶ Bases:
object
-
arguments
= None¶ A list of tuples (argloc, argid), specifying the location and identifier for each of the predicate’s argument in the containing sentence. Argument identifiers are strings such as
'ARG0'
or'ARGM-TMP'
. This list does not contain the predicate.
-
baseform
¶ The baseform of the predicate.
-
fileid
= None¶ The name of the file containing the parse tree for this instance’s sentence.
-
inflection
= None¶ A
PropbankInflection
object describing the inflection of this instance’s predicate.
-
parse_corpus
= None¶ A corpus reader for the parse trees corresponding to the instances in this propbank corpus.
-
predicate
= None¶ A
PropbankTreePointer
indicating the position of this instance’s predicate within its containing sentence.
-
predid
¶ Identifier of the predicate.
-
roleset
= None¶ The name of the roleset used by this instance’s predicate. Use
propbank.roleset() <PropbankCorpusReader.roleset>
to look up information about the roleset.
-
sensenumber
¶ The sense number of the predicate.
-
sentnum
= None¶ The sentence number of this sentence within
fileid
. Indexing starts from zero.
-
tagger
= None¶ An identifier for the tagger who tagged this instance; or
'gold'
if this is an adjuticated instance.
-
tree
¶ The parse tree corresponding to this instance, or None if the corresponding tree is not available.
-
unicode_repr
()¶
-
wordnum
= None¶ The word number of this instance’s predicate within its containing sentence. Word numbers are indexed starting from zero, and include traces and other empty parse elements.
-
-
class
nltk.corpus.reader.propbank.
PropbankPointer
[source]¶ Bases:
object
A pointer used by propbank to identify one or more constituents in a parse tree.
PropbankPointer
is an abstract base class with three concrete subclasses:PropbankTreePointer
is used to point to single constituents.PropbankSplitTreePointer
is used to point to ‘split’ constituents, which consist of a sequence of two or morePropbankTreePointer
pointers.PropbankChainTreePointer
is used to point to entire trace chains in a tree. It consists of a sequence of pieces, which can bePropbankTreePointer
orPropbankSplitTreePointer
pointers.
-
class
nltk.corpus.reader.propbank.
PropbankSplitTreePointer
(pieces)[source]¶ Bases:
nltk.corpus.reader.propbank.PropbankPointer
-
pieces
= None¶ A list of the pieces that make up this chain. Elements are all
PropbankTreePointer
pointers.
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.propbank.
PropbankTreePointer
(wordnum, height)[source]¶ Bases:
nltk.corpus.reader.propbank.PropbankPointer
wordnum:height*wordnum:height*... wordnum:height,
-
treepos
(tree)[source]¶ Convert this pointer to a standard ‘tree position’ pointer, given that it points to the given tree.
-
unicode_repr
()¶
-
nltk.corpus.reader.pros_cons module¶
CorpusReader for the Pros and Cons dataset.
- Pros and Cons dataset information -
- Contact: Bing Liu, liub@cs.uic.edu
- http://www.cs.uic.edu/~liub
Distributed with permission.
Related papers:
- Murthy Ganapathibhotla and Bing Liu. “Mining Opinions in Comparative Sentences”.
- Proceedings of the 22nd International Conference on Computational Linguistics (Coling-2008), Manchester, 18-22 August, 2008.
- Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing
- Opinions on the Web”. Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan.
-
class
nltk.corpus.reader.pros_cons.
ProsConsCorpusReader
(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), encoding='utf8', **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.api.CorpusReader
Reader for the Pros and Cons sentence dataset.
>>> from nltk.corpus import pros_cons >>> pros_cons.sents(categories='Cons') [['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy', 'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'], ...] >>> pros_cons.words('IntegratedPros.txt') ['Easy', 'to', 'use', ',', 'economical', '!', ...]
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
sents
(fileids=None, categories=None)[source]¶ Return all sentences in the corpus or in the specified files/categories.
Parameters: - fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
- categories – a list specifying the categories whose sentences have to be returned.
Returns: the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.
Return type:
-
words
(fileids=None, categories=None)[source]¶ Return all words and punctuation symbols in the corpus or in the specified files/categories.
Parameters: - fileids – a list or regexp specifying the ids of the files whose words have to be returned.
- categories – a list specifying the categories whose words have to be returned.
Returns: the given file(s) as a list of words and punctuation symbols.
Return type:
-
nltk.corpus.reader.reviews module¶
CorpusReader for reviews corpora (syntax based on Customer Review Corpus).
- Customer Review Corpus information -
- Annotated by: Minqing Hu and Bing Liu, 2004.
- Department of Computer Sicence University of Illinois at Chicago
- Contact: Bing Liu, liub@cs.uic.edu
- http://www.cs.uic.edu/~liub
Distributed with permission.
The “product_reviews_1” and “product_reviews_2” datasets respectively contain annotated customer reviews of 5 and 9 products from amazon.com.
Related papers:
- Minqing Hu and Bing Liu. “Mining and summarizing customer reviews”.
- Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-04), 2004.
- Minqing Hu and Bing Liu. “Mining Opinion Features in Customer Reviews”.
- Proceedings of Nineteeth National Conference on Artificial Intelligence (AAAI-2004), 2004.
- Xiaowen Ding, Bing Liu and Philip S. Yu. “A Holistic Lexicon-Based Appraoch to
- Opinion Mining.” Proceedings of First ACM International Conference on Web Search and Data Mining (WSDM-2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA.
Symbols used in the annotated reviews:
[t] : the title of the review: Each [t] tag starts a review. xxxx[+|-n]: xxxx is a product feature. [+n]: Positive opinion, n is the opinion strength: 3 strongest, and 1 weakest.
Note that the strength is quite subjective. You may want ignore it, but only considering + and -[-n]: Negative opinion ## : start of each sentence. Each line is a sentence. [u] : feature not appeared in the sentence. [p] : feature not appeared in the sentence. Pronoun resolution is needed. [s] : suggestion or recommendation. [cc]: comparison with a competing product from a different brand. [cs]: comparison with a competing product from the same brand.
- Note: Some of the files (e.g. “ipod.txt”, “Canon PowerShot SD500.txt”) do not
- provide separation between different reviews. This is due to the fact that the dataset was specifically designed for aspect/feature-based sentiment analysis, for which sentence-level annotation is sufficient. For document- level classification and analysis, this peculiarity should be taken into consideration.
-
class
nltk.corpus.reader.reviews.
Review
(title=None, review_lines=None)[source]¶ Bases:
object
A Review is the main block of a ReviewsCorpusReader.
-
add_line
(review_line)[source]¶ Add a line (ReviewLine) to the review.
Parameters: review_line – a ReviewLine instance that belongs to the Review.
-
features
()[source]¶ Return a list of features in the review. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.
Returns: all features of the review as a list of tuples (feat, score). Return type: list(tuple)
-
sents
()[source]¶ Return all tokenized sentences in the review.
Returns: all sentences of the review as lists of tokens. Return type: list(list(str))
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.reviews.
ReviewLine
(sent, features=None, notes=None)[source]¶ Bases:
object
A ReviewLine represents a sentence of the review, together with (optional) annotations of its features and notes about the reviewed item.
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.reviews.
ReviewsCorpusReader
(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.
>>> from nltk.corpus import product_reviews_1 >>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt') >>> review = camera_reviews[0] >>> review.sents()[0] ['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am', 'extremely', 'satisfied', 'with', 'the', 'purchase', '.'] >>> review.features() [('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'), ('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'), ('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'), ('option', '+1')]
We can also reach the same information directly from the stream:
>>> product_reviews_1.features('Canon_G3.txt') [('canon powershot g3', '+3'), ('use', '+2'), ...]
We can compute stats for specific product features:
>>> from __future__ import division >>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> # We use float for backward compatibility with division in Python2.7 >>> mean = tot / n_reviews >>> print(n_reviews, tot, mean) 15 24 1.6
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
features
(fileids=None)[source]¶ Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.
Parameters: fileids – a list or regexp specifying the ids of the files whose features have to be returned. Returns: all features for the item(s) in the given file(s). Return type: list(tuple)
-
raw
(fileids=None)[source]¶ Parameters: fileids – a list or regexp specifying the fileids of the files that have to be returned as a raw string. Returns: the given file(s) as a single string. Return type: str
-
reviews
(fileids=None)[source]¶ Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.
Parameters: fileids – a list or regexp specifying the ids of the files whose reviews have to be returned. Returns: the given file(s) as a list of reviews.
-
sents
(fileids=None)[source]¶ Return all sentences in the corpus or in the specified files.
Parameters: fileids – a list or regexp specifying the ids of the files whose sentences have to be returned. Returns: the given file(s) as a list of sentences, each encoded as a list of word strings. Return type: list(list(str))
-
words
(fileids=None)[source]¶ Return all words and punctuation symbols in the corpus or in the specified files.
Parameters: fileids – a list or regexp specifying the ids of the files whose words have to be returned. Returns: the given file(s) as a list of words and punctuation symbols. Return type: list(str)
-
nltk.corpus.reader.rte module¶
Corpus reader for the Recognizing Textual Entailment (RTE) Challenge Corpora.
The files were taken from the RTE1, RTE2 and RTE3 datasets and the files were regularized.
Filenames are of the form rte*_dev.xml and rte*_test.xml. The latter are the gold standard annotated files.
Each entailment corpus is a list of ‘text’/’hypothesis’ pairs. The following example is taken from RTE3:
<pair id="1" entailment="YES" task="IE" length="short" >
<t>The sale was made to pay Yukos' US$ 27.5 billion tax bill,
Yuganskneftegaz was originally sold for US$ 9.4 billion to a little known
company Baikalfinansgroup which was later bought by the Russian
state-owned oil company Rosneft .</t>
<h>Baikalfinansgroup was sold to Rosneft.</h>
</pair>
In order to provide globally unique IDs for each pair, a new attribute
challenge
has been added to the root element entailment-corpus
of each
file, taking values 1, 2 or 3. The GID is formatted ‘m-n’, where ‘m’ is the
challenge number and ‘n’ is the pair ID.
-
class
nltk.corpus.reader.rte.
RTECorpusReader
(root, fileids, wrap_etree=False)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for corpora in RTE challenges.
This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.
-
class
nltk.corpus.reader.rte.
RTEPair
(pair, challenge=None, id=None, text=None, hyp=None, value=None, task=None, length=None)[source]¶ Bases:
object
Container for RTE text-hypothesis pairs.
The entailment relation is signalled by the
value
attribute in RTE1, and byentailment
in RTE2 and RTE3. These both get mapped on to theentailment
attribute of this class.-
unicode_repr
()¶
-
nltk.corpus.reader.semcor module¶
Corpus reader for the SemCor Corpus.
-
class
nltk.corpus.reader.semcor.
SemcorCorpusReader
(root, fileids, wordnet, lazy=True)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the
xml()
method. For access to simple word lists and tagged word lists, usewords()
,sents()
,tagged_words()
, andtagged_sents()
.-
chunk_sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of chunks. Return type: list(list(list(str)))
-
chunks
(fileids=None)[source]¶ Returns: the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit. Return type: list(list(str))
-
sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of word strings. Return type: list(list(str))
-
tagged_chunks
(fileids=None, tag='pos')[source]¶ Returns: the given file(s) as a list of tagged chunks, represented in tree form. Return type: list(Tree) Parameters: tag – ‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
-
tagged_sents
(fileids=None, tag='pos')[source]¶ Returns: the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form). Return type: list(list(Tree)) Parameters: tag – ‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
-
-
class
nltk.corpus.reader.semcor.
SemcorSentence
(num, items)[source]¶ Bases:
list
A list of words, augmented by an attribute
num
used to record the sentence identifier (then
attribute from the XML).
-
class
nltk.corpus.reader.semcor.
SemcorWordView
(fileid, unit, bracket_sent, pos_tag, sem_tag, wordnet)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusView
A stream backed corpus view specialized for use with the BNC corpus.
nltk.corpus.reader.senseval module¶
Read from the Senseval 2 Corpus.
SENSEVAL [http://www.senseval.org/] Evaluation exercises for Word Sense Disambiguation. Organized by ACL-SIGLEX [http://www.siglex.org/]
Prepared by Ted Pedersen <tpederse@umn.edu>, University of Minnesota, http://www.d.umn.edu/~tpederse/data.html Distributed with permission.
The NLTK version of the Senseval 2 files uses well-formed XML. Each instance of the ambiguous words “hard”, “interest”, “line”, and “serve” is tagged with a sense identifier, and supplied with context.
-
class
nltk.corpus.reader.senseval.
SensevalCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶
nltk.corpus.reader.sentiwordnet module¶
An NLTK interface for SentiWordNet
SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity.
For details about SentiWordNet see: http://sentiwordnet.isti.cnr.it/
>>> from nltk.corpus import sentiwordnet as swn
>>> print(swn.senti_synset('breakdown.n.03'))
<breakdown.n.03: PosScore=0.0 NegScore=0.25>
>>> list(swn.senti_synsets('slow'))
[SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'),
SentiSynset('slow.v.03'), SentiSynset('slow.a.01'),
SentiSynset('slow.a.02'), SentiSynset('dense.s.04'),
SentiSynset('slow.a.04'), SentiSynset('boring.s.01'),
SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'),
SentiSynset('behind.r.03')]
>>> happy = swn.senti_synsets('happy', 'a')
>>> happy0 = list(happy)[0]
>>> happy0.pos_score()
0.875
>>> happy0.neg_score()
0.0
>>> happy0.obj_score()
0.125
nltk.corpus.reader.sinica_treebank module¶
Sinica Treebank Corpus Sample
http://rocling.iis.sinica.edu.tw/CKIP/engversion/treebank.htm
10,000 parsed sentences, drawn from the Academia Sinica Balanced Corpus of Modern Chinese. Parse tree notation is based on Information-based Case Grammar. Tagset documentation is available at http://www.sinica.edu.tw/SinicaCorpus/modern_e_wordtype.html
Language and Knowledge Processing Group, Institute of Information Science, Academia Sinica
It is distributed with the Natural Language Toolkit under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike License [http://creativecommons.org/licenses/by-nc-sa/2.5/].
References:
Feng-Yi Chen, Pi-Fang Tsai, Keh-Jiann Chen, and Chu-Ren Huang (1999) The Construction of Sinica Treebank. Computational Linguistics and Chinese Language Processing, 4, pp 87-104.
Huang Chu-Ren, Keh-Jiann Chen, Feng-Yi Chen, Keh-Jiann Chen, Zhao-Ming Gao, and Kuang-Yu Chen. 2000. Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface. Proceedings of 2nd Chinese Language Processing Workshop, Association for Computational Linguistics.
Chen Keh-Jiann and Yu-Ming Hsieh (2004) Chinese Treebanks and Grammar Extraction, Proceedings of IJCNLP-04, pp560-565.
-
class
nltk.corpus.reader.sinica_treebank.
SinicaTreebankCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.SyntaxCorpusReader
Reader for the sinica treebank.
nltk.corpus.reader.string_category module¶
Read tuples from a corpus consisting of categorized strings. For example, from the question classification corpus:
NUM:dist How far is it from Denver to Aspen ? LOC:city What county is Modesto , California in ? HUM:desc Who was Galileo ? DESC:def What is an atom ? NUM:date When did Hawaii become a state ?
nltk.corpus.reader.switchboard module¶
-
class
nltk.corpus.reader.switchboard.
SwitchboardTurn
(words, speaker, id)[source]¶ Bases:
list
A specialized list object used to encode switchboard utterances. The elements of the list are the words in the utterance; and two attributes,
speaker
andid
, are provided to retrieve the spearker identifier and utterance id. Note that utterance ids are only unique within a given discourse.-
unicode_repr
()¶
-
nltk.corpus.reader.tagged module¶
A reader for corpora whose documents contain part-of-speech-tagged words.
-
class
nltk.corpus.reader.tagged.
CategorizedTaggedCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.tagged.TaggedCorpusReader
A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.
-
class
nltk.corpus.reader.tagged.
MacMorphoCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.tagged.TaggedCorpusReader
A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by
self.paras()
andself.tagged_paras()
contains a single sentence.
-
class
nltk.corpus.reader.tagged.
TaggedCorpusReader
(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using
nltk.tag.str2tuple
. By default,'/'
is used as the separator. I.e., words should have the form:word1/tag1 word2/tag2 word3/tag3 ...
But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.
-
paras
(fileids=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. Return type: list(list(list(str)))
-
sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. Return type: list(list(str))
-
tagged_paras
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag)
tuples.Return type: list(list(list(tuple(str,str))))
-
tagged_sents
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of (word,tag)
tuples.Return type: list(list(tuple(str,str)))
-
-
class
nltk.corpus.reader.tagged.
TaggedCorpusView
(corpus_file, encoding, tagged, group_by_sent, group_by_para, sep, word_tokenizer, sent_tokenizer, para_block_reader, tag_mapping_function=None)[source]¶ Bases:
nltk.corpus.reader.util.StreamBackedCorpusView
A specialized corpus view for tagged documents. It can be customized via flags to divide the tagged corpus documents up by sentence or paragraph, and to include or omit part of speech tags.
TaggedCorpusView
objects are typically created byTaggedCorpusReader
(not directly by nltk users).
nltk.corpus.reader.timit module¶
Read tokens, phonemes and audio data from the NLTK TIMIT Corpus.
This corpus contains selected portion of the TIMIT corpus.
- 16 speakers from 8 dialect regions
- 1 male and 1 female from each dialect region
- total 130 sentences (10 sentences per speaker. Note that some sentences are shared among other speakers, especially sa1 and sa2 are spoken by all speakers.)
- total 160 recording of sentences (10 recordings per speaker)
- audio format: NIST Sphere, single channel, 16kHz sampling,
16 bit sample, PCM encoding
Module contents¶
The timit corpus reader provides 4 functions and 4 data items.
utterances
List of utterances in the corpus. There are total 160 utterances, each of which corresponds to a unique utterance of a speaker. Here’s an example of an utterance identifier in the list:
dr1-fvmh0/sx206 - _---- _--- | | | | | | | | | | | | | | `--- sentence number | | | `----- sentence type (a:all, i:shared, x:exclusive) | | `--------- speaker ID | `------------ sex (m:male, f:female) `-------------- dialect region (1..8)speakers
List of speaker IDs. An example of speaker ID:
dr1-fvmh0Note that if you split an item ID with colon and take the first element of the result, you will get a speaker ID.
>>> itemid = 'dr1-fvmh0/sx206' >>> spkrid , sentid = itemid.split('/') >>> spkrid 'dr1-fvmh0'The second element of the result is a sentence ID.
dictionary()
Phonetic dictionary of words contained in this corpus. This is a Python dictionary from words to phoneme lists.
spkrinfo()
Speaker information table. It’s a Python dictionary from speaker IDs to records of 10 fields. Speaker IDs the same as the ones in timie.speakers. Each record is a dictionary from field names to values, and the fields are as follows:
id speaker ID as defined in the original TIMIT speaker info table sex speaker gender (M:male, F:female) dr speaker dialect region (1:new england, 2:northern, 3:north midland, 4:south midland, 5:southern, 6:new york city, 7:western, 8:army brat (moved around)) use corpus type (TRN:training, TST:test) in this sample corpus only TRN is available recdate recording date birthdate speaker birth date ht speaker height race speaker race (WHT:white, BLK:black, AMR:american indian, SPN:spanish-american, ORN:oriental,???:unknown) edu speaker education level (HS:high school, AS:associate degree, BS:bachelor's degree (BS or BA), MS:master's degree (MS or MA), PHD:doctorate degree (PhD,JD,MD), ??:unknown) comments comments by the recorder
The 4 functions are as follows.
tokenized(sentences=items, offset=False)
Given a list of items, returns an iterator of a list of word lists, each of which corresponds to an item (sentence). If offset is set to True, each element of the word list is a tuple of word(string), start offset and end offset, where offset is represented as a number of 16kHz samples.
phonetic(sentences=items, offset=False)
Given a list of items, returns an iterator of a list of phoneme lists, each of which corresponds to an item (sentence). If offset is set to True, each element of the phoneme list is a tuple of word(string), start offset and end offset, where offset is represented as a number of 16kHz samples.
audiodata(item, start=0, end=None)
Given an item, returns a chunk of audio samples formatted into a string. When the fuction is called, if start and end are omitted, the entire samples of the recording will be returned. If only end is omitted, samples from the start offset to the end of the recording will be returned.
play(data)
Play the given audio samples. The audio samples can be obtained from the timit.audiodata function.
-
class
nltk.corpus.reader.timit.
SpeakerInfo
(id, sex, dr, use, recdate, birthdate, ht, race, edu, comments=None)[source]¶ Bases:
object
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.timit.
TimitCorpusReader
(root, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:
- timitdic.txt: dictionary of standard transcriptions
- spkrinfo.txt: table of speaker information
In addition, the root directory should contain one subdirectory for each speaker, containing three files for each utterance:
- <utterance-id>.txt: text content of utterances
- <utterance-id>.wrd: tokenized text content of utterances
- <utterance-id>.phn: phonetic transcription of utterances
- <utterance-id>.wav: utterance sound file
-
fileids
(filetype=None)[source]¶ Return a list of file identifiers for the files that make up this corpus.
Parameters: filetype – If specified, then filetype
indicates that only the files that have the given type should be returned. Accepted values are:txt
,wrd
,phn
,wav
, ormetadata
,
-
play
(utterance, start=0, end=None)[source]¶ Play the given audio sample.
Parameters: utterance – The utterance id of the sample to play
-
spkrutteranceids
(speaker)[source]¶ Returns: A list of all utterances associated with a given speaker.
-
transcription_dict
()[source]¶ Returns: A dictionary giving the ‘standard’ transcription for each word.
nltk.corpus.reader.toolbox module¶
Module for reading, writing and manipulating Toolbox databases and settings fileids.
-
class
nltk.corpus.reader.toolbox.
ToolboxCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶
nltk.corpus.reader.twitter module¶
A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON.
-
class
nltk.corpus.reader.twitter.
TwitterCorpusReader
(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.
Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.
Construct a new Tweet corpus reader for a set of documents located at the given root directory.
If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:
from nltk.corpus import TwitterCorpusReader reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')
However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:
root = os.environ['TWITTER'] reader = TwitterCorpusReader(root, '.*\.json')
If you want to work directly with the raw Tweets, the json library can be used:
import json for tweet in reader.docs(): print(json.dumps(tweet, indent=1, sort_keys=True))
-
CorpusView
¶ The corpus view class used by this reader.
alias of
StreamBackedCorpusView
-
docs
(fileids=None)[source]¶ Returns the full Tweet objects, as specified by Twitter documentation on Tweets
Returns: the given file(s) as a list of dictionaries deserialised from JSON. :rtype: list(dict)
-
nltk.corpus.reader.udhr module¶
UDHR corpus reader. It mostly deals with encodings.
-
class
nltk.corpus.reader.udhr.
UdhrCorpusReader
(root='udhr')[source]¶ Bases:
nltk.corpus.reader.plaintext.PlaintextCorpusReader
-
ENCODINGS
= [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]¶
-
SKIP
= {'Burmese_Myanmar-UTF8', 'Amharic-Afenegus6..60375', 'Vietnamese-VIQR', 'Esperanto-T61', 'Burmese_Myanmar-WinResearcher', 'Armenian-DallakHelv', 'Magahi-UTF8', 'Magahi-Agra', 'Hungarian_Magyar-Unicode', 'Gujarati-UTF8', 'Bhojpuri-Agra', 'Russian_Russky-UTF8~', 'Tigrinya_Tigrigna-VG2Main', 'Chinese_Mandarin-UTF8', 'Marathi-UTF8', 'Chinese_Mandarin-HZ', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Vietnamese-TCVN', 'Japanese_Nihongo-JIS', 'Navaho_Dine-Navajo-Navaho-font', 'Lao-UTF8', 'Czech-Latin2-err', 'Vietnamese-VPS', 'Tamil-UTF8', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117'}¶
-
nltk.corpus.reader.util module¶
-
class
nltk.corpus.reader.util.
ConcatenatedCorpusView
(corpus_views)[source]¶ Bases:
nltk.collections.AbstractLazySequence
A ‘view’ of a corpus file that joins together one or more
StreamBackedCorpusViews<StreamBackedCorpusView>
. At most one file handle is left open at any time.
-
class
nltk.corpus.reader.util.
PickleCorpusView
(fileid, delete_on_gc=False)[source]¶ Bases:
nltk.corpus.reader.util.StreamBackedCorpusView
A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using
pickle.dump
). One use case for this class is to store the result of running feature detection on a corpus to disk. This can be useful when performing feature detection is expensive (so we don’t want to repeat it); but the corpus is too large to store in memory. The following example illustrates this technique:>>> from nltk.corpus.reader.util import PickleCorpusView >>> from nltk.util import LazyMap >>> feature_corpus = LazyMap(detect_features, corpus) >>> PickleCorpusView.write(feature_corpus, some_fileid) >>> pcv = PickleCorpusView(some_fileid)
-
BLOCK_SIZE
= 100¶
-
PROTOCOL
= -1¶
-
classmethod
cache_to_tempfile
(sequence, delete_on_gc=True)[source]¶ Write the given sequence to a temporary file as a pickle corpus; and then return a
PickleCorpusView
view for that temporary corpus file.Parameters: delete_on_gc – If true, then the temporary file will be deleted whenever this object gets garbage-collected.
-
-
class
nltk.corpus.reader.util.
StreamBackedCorpusView
(fileid, block_reader=None, startpos=0, encoding='utf8')[source]¶ Bases:
nltk.collections.AbstractLazySequence
A ‘view’ of a corpus file, which acts like a sequence of tokens: it can be accessed by index, iterated over, etc. However, the tokens are only constructed as-needed – the entire corpus is never stored in memory at once.
The constructor to
StreamBackedCorpusView
takes two arguments: a corpus fileid (specified as a string or as aPathPointer
); and a block reader. A “block reader” is a function that reads zero or more tokens from a stream, and returns them as a list. A very simple example of a block reader is:>>> def simple_block_reader(stream): ... return stream.readline().split()
This simple block reader reads a single line at a time, and returns a single token (consisting of a string) for each whitespace-separated substring on the line.
When deciding how to define the block reader for a given corpus, careful consideration should be given to the size of blocks handled by the block reader. Smaller block sizes will increase the memory requirements of the corpus view’s internal data structures (by 2 integers per block). On the other hand, larger block sizes may decrease performance for random access to the corpus. (But note that larger block sizes will not decrease performance for iteration.)
Internally,
CorpusView
maintains a partial mapping from token index to file position, with one entry per block. When a token with a given index i is requested, theCorpusView
constructs it as follows:- First, it searches the toknum/filepos mapping for the token index closest to (but less than or equal to) i.
- Then, starting at the file position corresponding to that index, it reads one block at a time using the block reader until it reaches the requested token.
The toknum/filepos mapping is created lazily: it is initially empty, but every time a new block is read, the block’s initial token is added to the mapping. (Thus, the toknum/filepos map has one entry per block.)
In order to increase efficiency for random access patterns that have high degrees of locality, the corpus view may cache one or more blocks.
Note: Each
CorpusView
object internally maintains an open file object for its underlying corpus file. This file should be automatically closed when theCorpusView
is garbage collected, but if you wish to close it manually, use theclose()
method. If you access aCorpusView
‘s items after it has been closed, the file object will be automatically re-opened.Warning: If the contents of the file are modified during the lifetime of the
CorpusView
, then theCorpusView
‘s behavior is undefined.Warning: If a unicode encoding is specified when constructing a
CorpusView
, then the block reader may only callstream.seek()
with offsets that have been returned bystream.tell()
; in particular, callingstream.seek()
with relative offsets, or with offsets based on string lengths, may lead to incorrect behavior.Variables: - _block_reader – The function used to read a single block from the underlying file stream.
- _toknum – A list containing the token index of each block
that has been processed. In particular,
_toknum[i]
is the token index of the first token in blocki
. Together with_filepos
, this forms a partial mapping between token indices and file positions. - _filepos – A list containing the file position of each block
that has been processed. In particular,
_toknum[i]
is the file position of the first character in blocki
. Together with_toknum
, this forms a partial mapping between token indices and file positions. - _stream – The stream used to access the underlying corpus file.
- _len – The total number of tokens in the corpus, if known; or None, if the number of tokens is not yet known.
- _eofpos – The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached.
- _cache – A cache of the most recently read block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block.
-
close
()[source]¶ Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view). If the corpus view is accessed after it is closed, it will be automatically re-opened.
-
fileid
¶ The fileid of the file that is accessed by this view.
Type: str or PathPointer
-
nltk.corpus.reader.util.
concat
(docs)[source]¶ Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.
-
nltk.corpus.reader.util.
read_regexp_block
(stream, start_re, end_re=None)[source]¶ Read a sequence of tokens from a stream, where tokens begin with lines that match
start_re
. Ifend_re
is specified, then tokens end with lines that matchend_re
; otherwise, tokens end whenever the next line matchingstart_re
or EOF is found.
-
nltk.corpus.reader.util.
read_sexpr_block
(stream, block_size=16384, comment_char=None)[source]¶ Read a sequence of s-expressions from the stream, and leave the stream’s file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.
If the file ends in in the middle of an s-expression, then that incomplete s-expression is returned when the end of the file is reached.
Parameters: - block_size – The default block size for reading. If an s-expression is longer than one block, then more than one block will be read.
- comment_char – A character that marks comments. Any lines that begin with this character will be stripped out. (If spaces or tabs precede the comment character, then the line will not be stripped.)
nltk.corpus.reader.verbnet module¶
An NLTK interface to the VerbNet verb lexicon
For details about VerbNet see: http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
-
class
nltk.corpus.reader.verbnet.
VerbnetCorpusReader
(root, fileids, wrap_etree=False)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
An NLTK interface to the VerbNet verb lexicon.
From the VerbNet site: “VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet (Miller, 1990; Fellbaum, 1998), Xtag (XTAG Research Group, 2001), and FrameNet (Baker et al., 1998).”
For details about VerbNet see: http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
-
classids
(lemma=None, wordnetid=None, fileid=None, classid=None)[source]¶ Return a list of the verbnet class identifiers. If a file identifier is specified, then return only the verbnet class identifiers for classes (and subclasses) defined by that file. If a lemma is specified, then return only verbnet class identifiers for classes that contain that lemma as a member. If a wordnetid is specified, then return only identifiers for classes that contain that wordnetid as a member. If a classid is specified, then return only identifiers for subclasses of the specified verbnet class.
-
fileids
(vnclass_ids=None)[source]¶ Return a list of fileids that make up this corpus. If
vnclass_ids
is specified, then return the fileids that make up the specified verbnet class(es).
-
lemmas
(classid=None)[source]¶ Return a list of all verb lemmas that appear in any class, or in the
classid
if specified.
-
longid
(shortid)[source]¶ Given a short verbnet class identifier (eg ‘37.10’), map it to a long id (eg ‘confess-37.10’). If
shortid
is already a long id, then return it as-is
-
pprint
(vnclass)[source]¶ Return a string containing a pretty-printed representation of the given verbnet class.
Parameters: vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
-
pprint_description
(vnframe, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet frame description.
Parameters: vnframe – An ElementTree containing the xml contents of a verbnet frame.
-
pprint_frame
(vnframe, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet frame.
Parameters: vnframe – An ElementTree containing the xml contents of a verbnet frame.
-
pprint_members
(vnclass, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet class’s member verbs.
Parameters: vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
-
pprint_semantics
(vnframe, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet frame semantics.
Parameters: vnframe – An ElementTree containing the xml contents of a verbnet frame.
-
pprint_subclasses
(vnclass, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet class’s subclasses.
Parameters: vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
-
pprint_syntax
(vnframe, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet frame syntax.
Parameters: vnframe – An ElementTree containing the xml contents of a verbnet frame.
-
pprint_themroles
(vnclass, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet class’s thematic roles.
Parameters: vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
-
shortid
(longid)[source]¶ Given a long verbnet class identifier (eg ‘confess-37.10’), map it to a short id (eg ‘37.10’). If
longid
is already a short id, then return it as-is.
-
vnclass
(fileid_or_classid)[source]¶ Return an ElementTree containing the xml for the specified verbnet class.
Parameters: fileid_or_classid – An identifier specifying which class should be returned. Can be a file identifier (such as 'put-9.1.xml'
), or a verbnet class identifier (such as'put-9.1'
) or a short verbnet class identifier (such as'9.1'
).
-
nltk.corpus.reader.wordlist module¶
-
class
nltk.corpus.reader.wordlist.
MWAPPDBCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.wordlist.WordListCorpusReader
This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al. (2014a, 2014b, 2015):
The original source of the full PPDB corpus can be found on http://www.cis.upenn.edu/~ccb/ppdb/
Returns: a list of tuples of similar lexical terms. -
entries
(fileids='ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs')[source]¶ Returns: a tuple of synonym word pairs.
-
mwa_ppdb_xxxl_file
= 'ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs'¶
-
-
class
nltk.corpus.reader.wordlist.
NonbreakingPrefixesCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.wordlist.WordListCorpusReader
This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses’ word tokenizer.
-
available_langs
= {'lv': 'lv', 'de': 'de', 'tamil': 'ta', 'spanish': 'es', 'hu': 'hu', 'latvian': 'lv', 'slovenian': 'sl', 'sv': 'sv', 'finnish': 'fi', 'ca': 'ca', 'icelandic': 'is', 'french': 'fr', 'greek': 'el', 'english': 'en', 'fr': 'fr', 'slovak': 'sk', 'is': 'is', 'pt': 'pt', 'czech': 'cs', 'pl': 'pl', 'en': 'en', 'polish': 'pl', 'sl': 'sl', 'italian': 'it', 'catalan': 'ca', 'ro': 'ro', 'fi': 'fi', 'portuguese': 'pt', 'el': 'el', 'cs': 'cs', 'sk': 'sk', 'hungarian': 'hu', 'russian': 'ru', 'swedish': 'sv', 'ru': 'ru', 'romanian': 'ro', 'ta': 'ta', 'dutch': 'nl', 'es': 'es', 'nl': 'nl', 'german': 'de', 'it': 'it'}¶
-
words
(lang=None, fileids=None, ignore_lines_startswith='#')[source]¶ This module returns a list of nonbreaking prefixes for the specified language(s).
>>> from nltk.corpus import nonbreaking_prefixes as nbp >>> nbp.words('en')[:10] == [u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J'] True >>> nbp.words('ta')[:5] == [u'அ', u'ஆ', u'இ', u'ஈ', u'உ'] True
Returns: a list words for the specified language(s).
-
-
class
nltk.corpus.reader.wordlist.
SwadeshCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶
-
class
nltk.corpus.reader.wordlist.
UnicharsCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.wordlist.WordListCorpusReader
This class is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm
-
available_categories
= ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'Open_Punctuation']¶
-
chars
(category=None, fileids=None)[source]¶ This module returns a list of characters from the Perl Unicode Properties. They are very useful when porting Perl tokenizers to Python.
>>> from nltk.corpus import perluniprops as pup >>> pup.chars('Open_Punctuation')[:5] == [u'(', u'[', u'{', u'༺', u'༼'] True >>> pup.chars('Currency_Symbol')[:5] == [u'$', u'¢', u'£', u'¤', u'¥'] True >>> pup.available_categories ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'Open_Punctuation']
Returns: a list of characters given the specific unicode character category
-
nltk.corpus.reader.wordnet module¶
An NLTK interface for WordNet
WordNet is a lexical database of English. Using synsets, helps find conceptual relationships between words such as hypernyms, hyponyms, synonyms, antonyms etc.
For details about WordNet see: http://wordnet.princeton.edu/
This module also allows you to find lemmas in languages other than English from the Open Multilingual Wordnet http://compling.hss.ntu.edu.sg/omw/
-
class
nltk.corpus.reader.wordnet.
Lemma
(wordnet_corpus_reader, synset, name, lexname_index, lex_id, syntactic_marker)[source]¶ Bases:
nltk.corpus.reader.wordnet._WordNetObject
The lexical entry for a single morphological form of a sense-disambiguated word.
Create a Lemma from a “<word>.<pos>.<number>.<lemma>” string where: <word> is the morphological stem identifying the synset <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0. <lemma> is the morphological form of interest
Note that <word> and <lemma> can be different, e.g. the Synset ‘salt.n.03’ has the Lemmas ‘salt.n.03.salt’, ‘salt.n.03.saltiness’ and ‘salt.n.03.salinity’.
Lemma attributes, accessible via methods with the same name:
- name: The canonical name of this lemma. - synset: The synset that this lemma belongs to. - syntactic_marker: For adjectives, the WordNet string identifying the
syntactic position relative modified noun. See: http://wordnet.princeton.edu/man/wninput.5WN.html#sect10 For all other parts of speech, this attribute is None.- count: The frequency of this lemma in wordnet.
Lemma methods:
Lemmas have the following methods for retrieving related Lemmas. They correspond to the names for the pointer symbols defined here: http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 These methods all return lists of Lemmas:
- antonyms
- hypernyms, instance_hypernyms
- hyponyms, instance_hyponyms
- member_holonyms, substance_holonyms, part_holonyms
- member_meronyms, substance_meronyms, part_meronyms
- topic_domains, region_domains, usage_domains
- attributes
- derivationally_related_forms
- entailments
- causes
- also_sees
- verb_groups
- similar_tos
- pertainyms
-
unicode_repr
()¶
-
class
nltk.corpus.reader.wordnet.
Synset
(wordnet_corpus_reader)[source]¶ Bases:
nltk.corpus.reader.wordnet._WordNetObject
Create a Synset from a “<lemma>.<pos>.<number>” string where: <lemma> is the word’s morphological stem <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0.
Synset attributes, accessible via methods with the same name:
- name: The canonical name of this synset, formed using the first lemma of this synset. Note that this may be different from the name passed to the constructor if that string used a different lemma to identify the synset.
- pos: The synset’s part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB.
- lemmas: A list of the Lemma objects for this synset.
- definition: The definition for this synset.
- examples: A list of example strings for this synset.
- offset: The offset in the WordNet dict file of this synset.
- lexname: The name of the lexicographer file containing this synset.
Synset methods:
Synsets have the following methods for retrieving related Synsets. They correspond to the names for the pointer symbols defined here: http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 These methods all return lists of Synsets.
- hypernyms, instance_hypernyms
- hyponyms, instance_hyponyms
- member_holonyms, substance_holonyms, part_holonyms
- member_meronyms, substance_meronyms, part_meronyms
- attributes
- entailments
- causes
- also_sees
- verb_groups
- similar_tos
Additionally, Synsets support the following methods specific to the hypernym relation:
- root_hypernyms
- common_hypernyms
- lowest_common_hypernyms
Note that Synsets do not support the following relations because these are defined by WordNet as lexical relations:
- antonyms
- derivationally_related_forms
- pertainyms
-
closure
(rel, depth=-1)[source]¶ Return the transitive closure of source under the rel relationship, breadth-first
>>> from nltk.corpus import wordnet as wn >>> dog = wn.synset('dog.n.01') >>> hyp = lambda s:s.hypernyms() >>> list(dog.closure(hyp)) [Synset('canine.n.02'), Synset('domestic_animal.n.01'), Synset('carnivore.n.01'), Synset('animal.n.01'), Synset('placental.n.01'), Synset('organism.n.01'), Synset('mammal.n.01'), Synset('living_thing.n.01'), Synset('vertebrate.n.01'), Synset('whole.n.02'), Synset('chordate.n.01'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]
-
common_hypernyms
(other)[source]¶ Find all synsets that are hypernyms of this synset and the other synset.
Parameters: other (Synset) – other input synset. Returns: The synsets that are hypernyms of both synsets.
-
hypernym_distances
(distance=0, simulate_root=False)[source]¶ Get the path(s) from this synset to the root, counting the distance of each node from the initial node on the way. A set of (synset, distance) tuples is returned.
Parameters: distance (int) – the distance (number of edges) from this hypernym to the original hypernym Synset
on which this method was called.Returns: A set of (Synset, int)
tuples where eachSynset
is a hypernym of the firstSynset
.
-
hypernym_paths
()[source]¶ Get the path(s) from this synset to the root, where each path is a list of the synset nodes traversed on the way to the root.
Returns: A list of lists, where each list gives the node sequence connecting the initial Synset
node and a root node.
-
jcn_similarity
(other, ic, verbose=False)[source]¶ Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects.
-
lch_similarity
(other, verbose=False, simulate_root=True)[source]¶ Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A score denoting the similarity of the two
Synset
objects, normally greater than 0. None is returned if no connecting path could be found. If aSynset
is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.- other (Synset) – The
-
lin_similarity
(other, ic, verbose=False)[source]¶ Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects, in the range 0 to 1.
-
lowest_common_hypernyms
(other, simulate_root=False, use_min_depth=False)[source]¶ Get a list of lowest synset(s) that both synsets have as a hypernym. When use_min_depth == False this means that the synset which appears as a hypernym of both self and other with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned
However, if use_min_depth == True then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned.
By setting the use_min_depth flag to True, the behavior of NLTK2 can be preserved. This was changed in NLTK3 to give more accurate results in a small set of cases, generally with synsets concerning people. (eg: ‘chef.n.01’, ‘fireman.n.01’, etc.)
This method is an implementation of Ted Pedersen’s “Lowest Common Subsumer” method from the Perl Wordnet module. It can return either “self” or “other” if they are a hypernym of the other.
Parameters: - other (Synset) – other input synset
- simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (False by default) creates a fake root that connects all the taxonomies. Set it to True to enable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will need to be added for nouns as well.
- use_min_depth (bool) – This setting mimics older (v2) behavior of NLTK wordnet If True, will use the min_depth function to calculate the lowest common hypernyms. This is known to give strange results for some synset pairs (eg: ‘chef.n.01’, ‘fireman.n.01’) but is retained for backwards compatibility
Returns: The synsets that are the lowest common hypernyms of both synsets
-
path_similarity
(other, verbose=False, simulate_root=True)[source]¶ Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A score denoting the similarity of the two
Synset
objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if aSynset
is compared with itself.- other (Synset) – The
-
res_similarity
(other, ic, verbose=False)[source]¶ Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).
-
shortest_path_distance
(other, simulate_root=False)[source]¶ Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, None is returned. If a node is compared with itself 0 is returned.
Parameters: other (Synset) – The Synset to which the shortest path will be found. Returns: The number of edges in the shortest path connecting the two nodes, or None if no path exists.
-
tree
(rel, depth=-1, cut_mark=None)[source]¶ >>> from nltk.corpus import wordnet as wn >>> dog = wn.synset('dog.n.01') >>> hyp = lambda s:s.hypernyms() >>> from pprint import pprint >>> pprint(dog.tree(hyp)) [Synset('dog.n.01'), [Synset('canine.n.02'), [Synset('carnivore.n.01'), [Synset('placental.n.01'), [Synset('mammal.n.01'), [Synset('vertebrate.n.01'), [Synset('chordate.n.01'), [Synset('animal.n.01'), [Synset('organism.n.01'), [Synset('living_thing.n.01'), [Synset('whole.n.02'), [Synset('object.n.01'), [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]]]]], [Synset('domestic_animal.n.01'), [Synset('animal.n.01'), [Synset('organism.n.01'), [Synset('living_thing.n.01'), [Synset('whole.n.02'), [Synset('object.n.01'), [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]
-
unicode_repr
()¶
-
wup_similarity
(other, verbose=False, simulate_root=True)[source]¶ Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A float score denoting the similarity of the two
Synset
objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.- other (Synset) – The
-
class
nltk.corpus.reader.wordnet.
WordNetCorpusReader
(root, omw_reader)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader used to access wordnet or its variants.
-
ADJ
= 'a'¶
-
ADJ_SAT
= 's'¶
-
ADV
= 'r'¶
-
MORPHOLOGICAL_SUBSTITUTIONS
= {'s': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}¶
-
NOUN
= 'n'¶
-
VERB
= 'v'¶
-
all_lemma_names
(pos=None, lang='eng')[source]¶ Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.
-
all_synsets
(pos=None)[source]¶ Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.
-
citation
(lang='omw')[source]¶ Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language
-
custom_lemmas
(tab_file, lang)[source]¶ Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK’s WordNet functions to then be used with that language.
See the “Tab files” section at http://compling.hss.ntu.edu.sg/omw/ for documentation on the Multilingual WordNet tab file format.
Parameters: tab_file – Tab file as a file or file-like object :type lang str :param lang ISO 639-3 code of the language of the tab file
-
ic
(corpus, weight_senses_equally=False, smoothing=1.0)[source]¶ Creates an information content lookup dictionary from a corpus.
Parameters: corpus (CorpusReader) – The corpus from which we create an information content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary
-
jcn_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects.
-
lch_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A score denoting the similarity of the two
Synset
objects, normally greater than 0. None is returned if no connecting path could be found. If aSynset
is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.- other (Synset) – The
-
lemmas
(lemma, pos=None, lang='eng')[source]¶ Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.
-
license
(lang='eng')[source]¶ Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language
-
lin_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects, in the range 0 to 1.
-
morphy
(form, pos=None, check_exceptions=True)[source]¶ Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.
>>> from nltk.corpus import wordnet as wn >>> print(wn.morphy('dogs')) dog >>> print(wn.morphy('churches')) church >>> print(wn.morphy('aardwolves')) aardwolf >>> print(wn.morphy('abaci')) abacus >>> wn.morphy('hardrock', wn.ADV) >>> print(wn.morphy('book', wn.NOUN)) book >>> wn.morphy('book', wn.ADJ)
-
path_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A score denoting the similarity of the two
Synset
objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if aSynset
is compared with itself.- other (Synset) – The
-
readme
(lang='omw')[source]¶ Return the contents of README (for omw) use lang=lang to get the readme for an individual language
-
res_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).
-
synsets
(lemma, pos=None, lang='eng', check_exceptions=True)[source]¶ Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.
-
wup_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A float score denoting the similarity of the two
Synset
objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.- other (Synset) – The
-
-
exception
nltk.corpus.reader.wordnet.
WordNetError
[source]¶ Bases:
Exception
An exception class for wordnet-related errors.
-
class
nltk.corpus.reader.wordnet.
WordNetICCorpusReader
(root, fileids)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader for the WordNet information content corpus.
-
ic
(icfile)[source]¶ Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.
Parameters: icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”) Returns: An information content dictionary
-
-
nltk.corpus.reader.wordnet.
jcn_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects.
-
nltk.corpus.reader.wordnet.
lch_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A score denoting the similarity of the two
Synset
objects, normally greater than 0. None is returned if no connecting path could be found. If aSynset
is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.- other (Synset) – The
-
nltk.corpus.reader.wordnet.
lin_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects, in the range 0 to 1.
-
nltk.corpus.reader.wordnet.
path_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A score denoting the similarity of the two
Synset
objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if aSynset
is compared with itself.- other (Synset) – The
-
nltk.corpus.reader.wordnet.
res_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).
-
nltk.corpus.reader.wordnet.
wup_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A float score denoting the similarity of the two
Synset
objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.- other (Synset) – The
nltk.corpus.reader.xmldocs module¶
Corpus reader for corpora whose documents are xml files.
(note – not named ‘xml’ to avoid conflicting w/ standard xml package)
-
class
nltk.corpus.reader.xmldocs.
XMLCorpusReader
(root, fileids, wrap_etree=False)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader for corpora whose documents are xml files.
Note that the
XMLCorpusReader
constructor does not take anencoding
argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info.-
words
(fileid=None)[source]¶ Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.
Returns: the given file’s text nodes as a list of words and punctuation symbols Return type: list(str)
-
-
class
nltk.corpus.reader.xmldocs.
XMLCorpusView
(fileid, tagspec, elt_handler=None)[source]¶ Bases:
nltk.corpus.reader.util.StreamBackedCorpusView
A corpus view that selects out specified elements from an XML file, and provides a flat list-like interface for accessing them. (Note:
XMLCorpusView
is not used byXMLCorpusReader
itself, but may be used by subclasses ofXMLCorpusReader
.)Every XML corpus view has a “tag specification”, indicating what XML elements should be included in the view; and each (non-nested) element that matches this specification corresponds to one item in the view. Tag specifications are regular expressions over tag paths, where a tag path is a list of element tag names, separated by ‘/’, indicating the ancestry of the element. Some examples:
'foo'
: A top-level element whose tag isfoo
.'foo/bar'
: An element whose tag isbar
and whose parent is a top-level element whose tag isfoo
.'.*/foo'
: An element whose tag isfoo
, appearing anywhere in the xml tree.'.*/(foo|bar)'
: An wlement whose tag isfoo
orbar
, appearing anywhere in the xml tree.
The view items are generated from the selected XML elements via the method
handle_elt()
. By default, this method returns the element as-is (i.e., as an ElementTree object); but it can be overridden, either via subclassing or via theelt_handler
constructor parameter.-
handle_elt
(elt, context)[source]¶ Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the
elt_handler
constructor argument, this method simply returnselt
.Returns: The view value corresponding to
elt
.Parameters: - elt (ElementTree) – The element that should be converted.
- context (str) – A string composed of element tags separated by
forward slashes, indicating the XML context of the given
element. For example, the string
'foo/bar/baz'
indicates that the element is abaz
element whose parent is abar
element and whose grandparent is a top-levelfoo
element.
nltk.corpus.reader.ycoe module¶
Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts. The corpus is distributed by the Oxford Text Archive: http://www.ota.ahds.ac.uk/ It is not included with NLTK.
The YCOE corpus is divided into 100 files, each representing an Old English prose text. Tags used within each text complies to the YCOE standard: http://www-users.york.ac.uk/~lang22/YCOE/YcoeHome.htm
-
class
nltk.corpus.reader.ycoe.
YCOECorpusReader
(root, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.
-
documents
(fileids=None)[source]¶ Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.
-
-
class
nltk.corpus.reader.ycoe.
YCOEParseCorpusReader
(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.bracket_parse.BracketParseCorpusReader
Specialized version of the standard bracket parse corpus reader that strips out (CODE ...) and (ID ...) nodes.
-
nltk.corpus.reader.ycoe.
documents
= {'coapollo.o3': 'Apollonius of Tyre', 'comart3.o23': 'Martyrology, III', 'coorosiu.o2': 'Orosius', 'cobyrhtf.o3': "Byrhtferth's Manual", 'coprefcath2.o3': "Ã\x86lfric's Preface to Catholic Homilies II", 'conicodA': 'Gospel of Nicodemus (A)', 'codocu2.o2': 'Documents 2 (O2)', 'cochronE.o34': 'Anglo-Saxon Chronicle E', 'coaelive.o3': "Ã\x86lfric's Lives of Saints", 'copreflives.o3': "Ã\x86lfric's Preface to Lives of Saints", 'covinceB': 'Saint Vincent (Bodley 343)', 'coeust': 'Saint Eustace and his companions', 'cochronC': 'Anglo-Saxon Chronicle C', 'coinspolX': "Wulfstan's Institute of Polity (X)", 'conicodC': 'Gospel of Nicodemus (C)', 'conicodD': 'Gospel of Nicodemus (D)', 'colacnu.o23': 'Lacnunga', 'cochronA.o23': 'Anglo-Saxon Chronicle A', 'comart2': 'Martyrology, II', 'coinspolD.o34': "Wulfstan's Institute of Polity (D)", 'coexodusP': 'Exodus (P)', 'colawwllad.o4': 'Laws, William I, Lad', 'coboeth.o2': "Boethius' Consolation of Philosophy", 'colwsigeXa.o34': "Ã\x86lfric's Letter to Wulfsige (Xa)", 'coleofri.o4': 'Leofric', 'comargaT': 'Saint Margaret (T)', 'codocu1.o1': 'Documents 1 (O1)', 'colwgeat': "Ã\x86lfric's Letter to Wulfgeat", 'coneot': 'Saint Neot', 'coherbar': 'Pseudo-Apuleius, Herbarium', 'coverhom': 'Vercelli Homilies', 'cowulf.o34': "Wulfstan's Homilies", 'cochdrul': 'Chrodegang of Metz, Rule', 'comart1': 'Martyrology, I', 'codocu3.o3': 'Documents 3 (O3)', 'coblick.o23': 'Blickling Homilies', 'codicts.o34': 'Dicts of Cato', 'coverhomL': 'Vercelli Homilies (L)', 'colsigewB': "Ã\x86lfric's Letter to Sigeweard (B)", 'cosolilo': "St. Augustine's Soliloquies", 'cosolsat1.o4': 'Solomon and Saturn I', 'colaece.o2': 'Leechdoms', 'colaw6atr.o3': 'Laws, Ã\x86thelred VI', 'coepigen.o3': "Ã\x86lfric's Epilogue to Genesis", 'colsigewZ.o34': "Ã\x86lfric's Letter to Sigeweard (Z)", 'codocu2.o12': 'Documents 2 (O1/O2)', 'cocanedgX': 'Canons of Edgar (X)', 'colwstan2.o3': "Ã\x86lfric's Letter to Wulfstan II", 'cobede.o2': "Bede's History of the English Church", 'cosevensl': 'Seven Sleepers', 'cogenesiC': 'Genesis (C)', 'cojames': 'Saint James', 'coprefcura.o2': 'Preface to the Cura Pastoralis', 'cootest.o3': 'Heptateuch', 'coalex.o23': "Alexander's Letter to Aristotle", 'coeuphr': 'Saint Euphrosyne', 'coalcuin': 'Alcuin De virtutibus et vitiis', 'cocanedgD': 'Canons of Edgar (D)', 'comargaC.o34': 'Saint Margaret (C)', 'cosolsat2': 'Solomon and Saturn II', 'coprefcath1.o3': "Ã\x86lfric's Preface to Catholic Homilies I", 'coprefgen.o3': "Ã\x86lfric's Preface to Genesis", 'cocathom2.o3': "Ã\x86lfric's Catholic Homilies II", 'colsigef.o3': "Ã\x86lfric's Letter to Sigefyrth", 'cogregdC.o24': "Gregory's Dialogues (C)", 'cochronD': 'Anglo-Saxon Chronicle D', 'covinsal': 'Vindicta Salvatoris', 'coquadru.o23': 'Pseudo-Apuleius, Medicina de quadrupedibus', 'colawger.o34': 'Laws, Gerefa', 'colawaf.o2': 'Laws, Alfred', 'codocu4.o24': 'Documents 4 (O2/O4)', 'colaw5atr.o3': 'Laws, Ã\x86thelred V', 'cowsgosp.o3': 'West-Saxon Gospels', 'cochad.o24': 'Saint Chad', 'cobenrul.o3': 'Benedictine Rule', 'coprefsolilo': "Preface to Augustine's Soliloquies", 'cocathom1.o3': "Ã\x86lfric's Catholic Homilies I", 'comarvel.o23': 'Marvels of the East', 'cogregdH.o23': "Gregory's Dialogues (H)", 'cochristoph': 'Saint Christopher', 'colawafint.o2': "Alfred's Introduction to Laws", 'coaelhom.o3': 'Ã\x86lfric, Supplemental Homilies', 'colaw2cn.o3': 'Laws, Cnut II', 'comary': 'Mary of Egypt', 'coeluc1': 'Honorius of Autun, Elucidarium 1', 'conicodE': 'Gospel of Nicodemus (E)', 'colwstan1.o3': "Ã\x86lfric's Letter to Wulfstan I", 'coverhomE': 'Vercelli Homilies (E)', 'cocura.o2': 'Cura Pastoralis', 'colawnorthu.o3': 'Northumbra Preosta Lagu', 'cocuraC': 'Cura Pastoralis (Cotton)', 'coeluc2': 'Honorius of Autun, Elucidarium 1', 'colwsigeT': "Ã\x86lfric's Letter to Wulfsige (T)", 'corood': 'History of the Holy Rood-Tree', 'coaugust': 'Augustine', 'coadrian.o34': 'Adrian and Ritheus', 'colawine.ox2': 'Laws, Ine', 'codocu3.o23': 'Documents 3 (O2/O3)', 'cotempo.o3': "Ã\x86lfric's De Temporibus Anni", 'colaw1cn.o3': 'Laws, Cnut I'}¶ A list of all documents and their titles in ycoe.
Module contents¶
NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.
Corpus Reader Functions¶
Each corpus module defines one or more “corpus reader functions”,
which can be used to read documents from that corpus. These functions
take an argument, item
, which is used to indicate which document
should be read from the corpus:
- If
item
is one of the unique identifiers listed in the corpus module’sitems
variable, then the corresponding document will be loaded from the NLTK corpus package. - If
item
is a fileid, then that file will be read.
Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.
Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:
- words(): list of str
- sents(): list of (list of str)
- paras(): list of (list of (list of str))
- tagged_words(): list of (str,str) tuple
- tagged_sents(): list of (list of (str,str))
- tagged_paras(): list of (list of (list of (str,str)))
- chunked_sents(): list of (Tree w/ (str,str) leaves)
- parsed_sents(): list of (Tree with str leaves)
- parsed_paras(): list of (list of (Tree with str leaves))
- xml(): A single xml ElementTree
- raw(): unprocessed corpus contents
For example, to read a list of the words in the Brown Corpus, use
nltk.corpus.brown.words()
:
>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()))
The, Fulton, County, Grand, Jury, said, ...
-
class
nltk.corpus.reader.
CorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
object
A base class for “corpus reader” classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory. Each file is identified by its
file identifier
, which is the relative path to the file from the root directory.A separate subclass is defined for each corpus format. These subclasses define one or more methods that provide ‘views’ on the corpus contents, such as
words()
(for a list of words) andparsed_sents()
(for a list of parsed sentences). Called with no arguments, these methods will return the contents of the entire corpus. For most corpora, these methods define one or more selection arguments, such asfileids
orcategories
, which can be used to select which portion of the corpus should be returned.-
abspath
(fileid)[source]¶ Return the absolute path for the given file.
Parameters: fileid (str) – The file identifier for the file whose path should be returned. Return type: PathPointer
-
abspaths
(fileids=None, include_encoding=False, include_fileid=False)[source]¶ Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.
Parameters: - fileids (None or str or list) – Specifies the set of fileids for which paths should
be returned. Can be None, for all fileids; a list of
file identifiers, for a specified set of fileids; or a single
file identifier, for a single file. Note that the return
value is always a list of paths, even if
fileids
is a single file identifier. - include_encoding – If true, then return a list of
(path_pointer, encoding)
tuples.
Return type: - fileids (None or str or list) – Specifies the set of fileids for which paths should
be returned. Can be None, for all fileids; a list of
file identifiers, for a specified set of fileids; or a single
file identifier, for a single file. Note that the return
value is always a list of paths, even if
-
encoding
(file)[source]¶ Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.
-
ensure_loaded
()[source]¶ Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).
-
open
(file)[source]¶ Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.
Parameters: file – The file identifier of the file to read.
-
root
¶ The directory where this corpus is stored.
Type: PathPointer
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.
CategorizedCorpusReader
(kwargs)[source]¶ Bases:
object
A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method
categories()
, which returns a list of the categories for the corpus or for a specified set of fileids; and overridesfileids()
to take acategories
argument, restricting the set of fileids to be returned.Subclasses are expected to:
- Call
__init__()
to set up the mapping. - Override all view methods to accept a
categories
parameter, which can be used instead of thefileids
parameter, to select which fileids should be included in the returned view.
- Call
-
class
nltk.corpus.reader.
PlaintextCorpusReader
(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the
CorpusView
class variable.-
CorpusView
¶ alias of
StreamBackedCorpusView
-
paras
(fileids=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. Return type: list(list(list(str)))
-
-
class
nltk.corpus.reader.
TaggedCorpusReader
(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using
nltk.tag.str2tuple
. By default,'/'
is used as the separator. I.e., words should have the form:word1/tag1 word2/tag2 word3/tag3 ...
But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.
-
paras
(fileids=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. Return type: list(list(list(str)))
-
sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. Return type: list(list(str))
-
tagged_paras
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag)
tuples.Return type: list(list(list(tuple(str,str))))
-
tagged_sents
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of (word,tag)
tuples.Return type: list(list(tuple(str,str)))
-
-
class
nltk.corpus.reader.
CMUDictCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
-
dict
()[source]¶ Returns: the cmudict lexicon as a dictionary, whose keys are lowercase words and whose values are lists of pronunciations.
-
-
class
nltk.corpus.reader.
ConllChunkCorpusReader
(root, fileids, chunk_types, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.conll.ConllCorpusReader
A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.
-
class
nltk.corpus.reader.
WordListCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
List of words, one per line. Blank lines are ignored.
-
class
nltk.corpus.reader.
PPAttachmentCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
sentence_id verb noun1 preposition noun2 attachment
-
class
nltk.corpus.reader.
ChunkedCorpusReader
(root, fileids, extension='', str2chunktree=<function tagstr2tree>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using
nltk.chunk.tagstr2tree
.-
chunked_paras
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags).Return type: list(list(Tree))
-
chunked_sents
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags).Return type: list(Tree)
-
chunked_words
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of tagged words and chunks. Words are encoded as (word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over(word,tag)
tuples or word strings.Return type: list(tuple(str,str) and Tree)
-
paras
(fileids=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. Return type: list(list(list(str)))
-
sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. Return type: list(list(str))
-
tagged_paras
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag)
tuples.Return type: list(list(list(tuple(str,str))))
-
tagged_sents
(fileids=None, tagset=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of (word,tag)
tuples.Return type: list(list(tuple(str,str)))
-
-
class
nltk.corpus.reader.
SinicaTreebankCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.SyntaxCorpusReader
Reader for the sinica treebank.
-
class
nltk.corpus.reader.
BracketParseCorpusReader
(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.SyntaxCorpusReader
Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the “combined” section of the Penn Treebank, e.g. “(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))”.
-
class
nltk.corpus.reader.
IndianCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
List of words, one per line. Blank lines are ignored.
-
class
nltk.corpus.reader.
ToolboxCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶
-
class
nltk.corpus.reader.
TimitCorpusReader
(root, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:
- timitdic.txt: dictionary of standard transcriptions
- spkrinfo.txt: table of speaker information
In addition, the root directory should contain one subdirectory for each speaker, containing three files for each utterance:
- <utterance-id>.txt: text content of utterances
- <utterance-id>.wrd: tokenized text content of utterances
- <utterance-id>.phn: phonetic transcription of utterances
- <utterance-id>.wav: utterance sound file
-
fileids
(filetype=None)[source]¶ Return a list of file identifiers for the files that make up this corpus.
Parameters: filetype – If specified, then filetype
indicates that only the files that have the given type should be returned. Accepted values are:txt
,wrd
,phn
,wav
, ormetadata
,
-
play
(utterance, start=0, end=None)[source]¶ Play the given audio sample.
Parameters: utterance – The utterance id of the sample to play
-
spkrutteranceids
(speaker)[source]¶ Returns: A list of all utterances associated with a given speaker.
-
transcription_dict
()[source]¶ Returns: A dictionary giving the ‘standard’ transcription for each word.
-
class
nltk.corpus.reader.
YCOECorpusReader
(root, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.
-
documents
(fileids=None)[source]¶ Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.
-
-
class
nltk.corpus.reader.
MacMorphoCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.tagged.TaggedCorpusReader
A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by
self.paras()
andself.tagged_paras()
contains a single sentence.
-
class
nltk.corpus.reader.
SyntaxCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:
__init__
, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files._read_block
, which reads a block from the input stream._word
, which takes a block and returns a list of list of words._tag
, which takes a block and returns a list of list of tagged words._parse
, which takes a block and returns a list of parsed sentences.
-
class
nltk.corpus.reader.
AlpinoCorpusReader
(root, encoding='ISO-8859-1', tagset=None)[source]¶ Bases:
nltk.corpus.reader.bracket_parse.BracketParseCorpusReader
Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter ‘ordered’ to the overridden _normalize function. The _parse function can then remain untouched.
-
class
nltk.corpus.reader.
RTECorpusReader
(root, fileids, wrap_etree=False)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for corpora in RTE challenges.
This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.
-
class
nltk.corpus.reader.
StringCategoryCorpusReader
(root, fileids, delimiter=' ', encoding='utf8')[source]¶
-
class
nltk.corpus.reader.
EuroparlCorpusReader
(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.plaintext.PlaintextCorpusReader
Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from
PlaintextCorpusReader
except that:- Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
- For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
- There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
- The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
-
class
nltk.corpus.reader.
CategorizedBracketParseCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.bracket_parse.BracketParseCorpusReader
A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>
-
class
nltk.corpus.reader.
CategorizedTaggedCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.tagged.TaggedCorpusReader
A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.
-
class
nltk.corpus.reader.
CategorizedPlaintextCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.plaintext.PlaintextCorpusReader
A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.
-
class
nltk.corpus.reader.
PortugueseCategorizedPlaintextCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader
-
class
nltk.corpus.reader.
PropbankCorpusReader
(root, propfile, framefiles='', verbsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-verb basis. Each “frameset file” contains one or more predicates, such as
'turn'
or'turn_on'
, each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.-
instances
(baseform=None)[source]¶ Returns: a corpus view that acts as a list of PropBankInstance
objects, one for each noun in the corpus.
-
-
class
nltk.corpus.reader.
VerbnetCorpusReader
(root, fileids, wrap_etree=False)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
An NLTK interface to the VerbNet verb lexicon.
From the VerbNet site: “VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet (Miller, 1990; Fellbaum, 1998), Xtag (XTAG Research Group, 2001), and FrameNet (Baker et al., 1998).”
For details about VerbNet see: http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
-
classids
(lemma=None, wordnetid=None, fileid=None, classid=None)[source]¶ Return a list of the verbnet class identifiers. If a file identifier is specified, then return only the verbnet class identifiers for classes (and subclasses) defined by that file. If a lemma is specified, then return only verbnet class identifiers for classes that contain that lemma as a member. If a wordnetid is specified, then return only identifiers for classes that contain that wordnetid as a member. If a classid is specified, then return only identifiers for subclasses of the specified verbnet class.
-
fileids
(vnclass_ids=None)[source]¶ Return a list of fileids that make up this corpus. If
vnclass_ids
is specified, then return the fileids that make up the specified verbnet class(es).
-
lemmas
(classid=None)[source]¶ Return a list of all verb lemmas that appear in any class, or in the
classid
if specified.
-
longid
(shortid)[source]¶ Given a short verbnet class identifier (eg ‘37.10’), map it to a long id (eg ‘confess-37.10’). If
shortid
is already a long id, then return it as-is
-
pprint
(vnclass)[source]¶ Return a string containing a pretty-printed representation of the given verbnet class.
Parameters: vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
-
pprint_description
(vnframe, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet frame description.
Parameters: vnframe – An ElementTree containing the xml contents of a verbnet frame.
-
pprint_frame
(vnframe, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet frame.
Parameters: vnframe – An ElementTree containing the xml contents of a verbnet frame.
-
pprint_members
(vnclass, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet class’s member verbs.
Parameters: vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
-
pprint_semantics
(vnframe, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet frame semantics.
Parameters: vnframe – An ElementTree containing the xml contents of a verbnet frame.
-
pprint_subclasses
(vnclass, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet class’s subclasses.
Parameters: vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
-
pprint_syntax
(vnframe, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet frame syntax.
Parameters: vnframe – An ElementTree containing the xml contents of a verbnet frame.
-
pprint_themroles
(vnclass, indent='')[source]¶ Return a string containing a pretty-printed representation of the given verbnet class’s thematic roles.
Parameters: vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
-
shortid
(longid)[source]¶ Given a long verbnet class identifier (eg ‘confess-37.10’), map it to a short id (eg ‘37.10’). If
longid
is already a short id, then return it as-is.
-
vnclass
(fileid_or_classid)[source]¶ Return an ElementTree containing the xml for the specified verbnet class.
Parameters: fileid_or_classid – An identifier specifying which class should be returned. Can be a file identifier (such as 'put-9.1.xml'
), or a verbnet class identifier (such as'put-9.1'
) or a short verbnet class identifier (such as'9.1'
).
-
-
class
nltk.corpus.reader.
BNCCorpusReader
(root, fileids, lazy=True)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for the XML version of the British National Corpus.
For access to the complete XML data structure, use the
xml()
method. For access to simple word lists and tagged word lists, usewords()
,sents()
,tagged_words()
, andtagged_sents()
.You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554
If you extracted the archive to a directory called BNC, then you can instantiate the reader as:
BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
-
sents
(fileids=None, strip_space=True, stem=False)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type: Parameters: - strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
tagged_sents
(fileids=None, c5=False, strip_space=True, stem=False)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.Return type: Parameters: - c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
tagged_words
(fileids=None, c5=False, strip_space=True, stem=False)[source]¶ Returns: the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples
(word,tag)
.Return type: Parameters: - c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
words
(fileids=None, strip_space=True, stem=False)[source]¶ Returns: the given file(s) as a list of words and punctuation symbols.
Return type: Parameters: - strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
-
class
nltk.corpus.reader.
ConllCorpusReader
(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.Tree'>, tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the
ConllCorpusReader
constructor therefore takes an argument,columntypes
, which is used to specify the columns that are used by a given corpus.- @todo: Add support for reading from corpora where different
- parallel files contain different columns.
- @todo: Possibly add caching of the grid corpus view? This would
- allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
- @todo: Better support for -DOCSTART-. Currently, we just ignore
- it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).
-
CHUNK
= 'chunk'¶
-
COLUMN_TYPES
= ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')¶
-
IGNORE
= 'ignore'¶
-
NE
= 'ne'¶
-
POS
= 'pos'¶
-
SRL
= 'srl'¶
-
TREE
= 'tree'¶
-
WORDS
= 'words'¶
-
iob_sents
(fileids=None, tagset=None)[source]¶ Returns: a list of lists of word/tag/IOB tuples Return type: list(list) Parameters: fileids (None or str or list) – the list of fileids that make up this corpus
-
class
nltk.corpus.reader.
XMLCorpusReader
(root, fileids, wrap_etree=False)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader for corpora whose documents are xml files.
Note that the
XMLCorpusReader
constructor does not take anencoding
argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info.-
words
(fileid=None)[source]¶ Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.
Returns: the given file’s text nodes as a list of words and punctuation symbols Return type: list(str)
-
-
class
nltk.corpus.reader.
WordNetCorpusReader
(root, omw_reader)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader used to access wordnet or its variants.
-
ADJ
= 'a'¶
-
ADJ_SAT
= 's'¶
-
ADV
= 'r'¶
-
MORPHOLOGICAL_SUBSTITUTIONS
= {'s': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}¶
-
NOUN
= 'n'¶
-
VERB
= 'v'¶
-
all_lemma_names
(pos=None, lang='eng')[source]¶ Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.
-
all_synsets
(pos=None)[source]¶ Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.
-
citation
(lang='omw')[source]¶ Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language
-
custom_lemmas
(tab_file, lang)[source]¶ Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK’s WordNet functions to then be used with that language.
See the “Tab files” section at http://compling.hss.ntu.edu.sg/omw/ for documentation on the Multilingual WordNet tab file format.
Parameters: tab_file – Tab file as a file or file-like object :type lang str :param lang ISO 639-3 code of the language of the tab file
-
ic
(corpus, weight_senses_equally=False, smoothing=1.0)[source]¶ Creates an information content lookup dictionary from a corpus.
Parameters: corpus (CorpusReader) – The corpus from which we create an information content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary
-
jcn_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects.
-
lch_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A score denoting the similarity of the two
Synset
objects, normally greater than 0. None is returned if no connecting path could be found. If aSynset
is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.- other (Synset) – The
-
lemmas
(lemma, pos=None, lang='eng')[source]¶ Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.
-
license
(lang='eng')[source]¶ Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language
-
lin_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects, in the range 0 to 1.
-
morphy
(form, pos=None, check_exceptions=True)[source]¶ Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.
>>> from nltk.corpus import wordnet as wn >>> print(wn.morphy('dogs')) dog >>> print(wn.morphy('churches')) church >>> print(wn.morphy('aardwolves')) aardwolf >>> print(wn.morphy('abaci')) abacus >>> wn.morphy('hardrock', wn.ADV) >>> print(wn.morphy('book', wn.NOUN)) book >>> wn.morphy('book', wn.ADJ)
-
path_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A score denoting the similarity of the two
Synset
objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if aSynset
is compared with itself.- other (Synset) – The
-
readme
(lang='omw')[source]¶ Return the contents of README (for omw) use lang=lang to get the readme for an individual language
-
res_similarity
(synset1, synset2, ic, verbose=False)[source]¶ Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).
Parameters: Returns: A float score denoting the similarity of the two
Synset
objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).
-
synsets
(lemma, pos=None, lang='eng', check_exceptions=True)[source]¶ Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.
-
wup_similarity
(synset1, synset2, verbose=False, simulate_root=True)[source]¶ Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.
Parameters: - other (Synset) – The
Synset
that thisSynset
is being compared to. - simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns: A float score denoting the similarity of the two
Synset
objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.- other (Synset) – The
-
-
class
nltk.corpus.reader.
WordNetICCorpusReader
(root, fileids)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader for the WordNet information content corpus.
-
ic
(icfile)[source]¶ Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.
Parameters: icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”) Returns: An information content dictionary
-
-
class
nltk.corpus.reader.
DependencyCorpusReader
(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block>)[source]¶
-
class
nltk.corpus.reader.
NombankCorpusReader
(root, nomfile, framefiles='', nounsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-noun basis. Each “frameset file” contains one or more predicates, such as
'turn'
or'turn_on'
, each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.-
instances
(baseform=None)[source]¶ Returns: a corpus view that acts as a list of NombankInstance
objects, one for each noun in the corpus.
-
lines
()[source]¶ Returns: a corpus view that acts as a list of strings, one for each line in the predicate-argument annotation file.
-
-
class
nltk.corpus.reader.
IPIPANCorpusReader
(root, fileids)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.
The corpus includes information about text domain, channel and categories. You can access possible values using
domains()
,channels()
andcategories()
. You can use also this metadata to filter files, e.g.:fileids(channel='prasa')
,fileids(categories='publicystyczny')
.The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.:
tagged_sents(simplify_tags=True)
.Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.:
tagged_paras(one_tag=False)
.You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g.
tagged_words(disamb_only=False)
.The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g.
tagged_words(append_no_space=True)
. As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g.
words(append_space=True)
. As a result either ‘ ‘ or (‘ ‘, ‘space’) will be inserted between tokens.By default, xml entities like " and & are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g.
words(replace_xmlentities=False)
.
-
class
nltk.corpus.reader.
Pl196xCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.xmldocs.XMLCorpusReader
-
head_len
= 2770¶
-
textids
(fileids=None, categories=None)[source]¶ In the pl196x corpus each category is stored in single file and thus both methods provide identical functionality. In order to accommodate finer granularity, a non-standard textids() method was implemented. All the main functions can be supplied with a list of required chunks—giving much more control to the user.
-
-
class
nltk.corpus.reader.
TEICorpusView
(corpus_file, tagged, group_by_sent, group_by_para, tagset=None, head_len=0, textids=None)[source]¶
-
class
nltk.corpus.reader.
KNBCorpusReader
(root, fileids, encoding='utf8', morphs2str=<function <lambda>>)[source]¶ Bases:
nltk.corpus.reader.api.SyntaxCorpusReader
- This class implements:
__init__
, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files._read_block
, which reads a block from the input stream._word
, which takes a block and returns a list of list of words._tag
, which takes a block and returns a list of list of tagged words._parse
, which takes a block and returns a list of parsed sentences.
- The structure of tagged words:
- tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others ...)
>>> from nltk.corpus.util import LazyCorpusLoader >>> knbc = LazyCorpusLoader( ... 'knbc/corpus1', ... KNBCorpusReader, ... r'.*/KN.*', ... encoding='euc-jp', ... )
>>> len(knbc.sents()[0]) 9
-
class
nltk.corpus.reader.
ChasenCorpusReader
(root, fileids, encoding='utf8', sent_splitter=None)[source]¶
-
class
nltk.corpus.reader.
CHILDESCorpusReader
(root, fileids, lazy=True)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at
http://childes.psy.cmu.edu/
. The XML version of CHILDES is located athttp://childes.psy.cmu.edu/data-xml/
. Copy the needed parts of the CHILDES XML corpus into the NLTK data directory (nltk_data/corpora/CHILDES/
).For access to the file text use the usual nltk functions,
words()
,sents()
,tagged_words()
andtagged_sents()
.-
MLU
(fileids=None, speaker='CHI')[source]¶ Returns: the given file(s) as a floating number Return type: list(float)
-
age
(fileids=None, speaker='CHI', month=False)[source]¶ Returns: the given file(s) as string or int Return type: list or int Parameters: month – If true, return months instead of year-month-date
-
childes_url_base
= 'http://childes.psy.cmu.edu/browser/index.php?url='¶
-
corpus
(fileids=None)[source]¶ Returns: the given file(s) as a dict of (corpus_property_key, value)
Return type: list(dict)
-
participants
(fileids=None)[source]¶ Returns: the given file(s) as a dict of (participant_property_key, value)
Return type: list(dict)
-
sents
(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type: Parameters: - speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
- stem – If true, then use word stems instead of word strings.
- relation – If true, then return tuples of
(str,pos,relation_list)
. If there is manually-annotated relation info, it will return tuples of(str,pos,test_relation_list,str,pos,gold_relation_list)
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
-
tagged_sents
(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.Return type: Parameters: - speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
- stem – If true, then use word stems instead of word strings.
- relation – If true, then return tuples of
(str,pos,relation_list)
. If there is manually-annotated relation info, it will return tuples of(str,pos,test_relation_list,str,pos,gold_relation_list)
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
-
tagged_words
(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]¶ Returns: the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples
(word,tag)
.Return type: Parameters: - speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
- stem – If true, then use word stems instead of word strings.
- relation – If true, then return tuples of (stem, index, dependent_index)
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
-
webview_file
(fileid, urlbase=None)[source]¶ Map a corpus file to its web version on the CHILDES website, and open it in a web browser.
- The complete URL to be used is:
- childes.childes_url_base + urlbase + fileid.replace(‘.xml’, ‘.cha’)
If no urlbase is passed, we try to calculate it. This requires that the childes corpus was set up to mirror the folder hierarchy under childes.psy.cmu.edu/data-xml/, e.g.: nltk_data/corpora/childes/Eng-USA/Cornell/??? or nltk_data/corpora/childes/Romance/Spanish/Aguirre/???
The function first looks (as a special case) if “Eng-USA” is on the path consisting of <corpus root>+fileid; then if “childes”, possibly followed by “data-xml”, appears. If neither one is found, we use the unmodified fileid and hope for the best. If this is not right, specify urlbase explicitly, e.g., if the corpus root points to the Cornell folder, urlbase=’Eng-USA/Cornell’.
-
words
(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]¶ Returns: the given file(s) as a list of words
Return type: Parameters: - speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
- stem – If true, then use word stems instead of word strings.
- relation – If true, then return tuples of (stem, index, dependent_index)
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
-
-
class
nltk.corpus.reader.
AlignedCorpusReader
(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.
-
aligned_sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of AlignedSent objects. Return type: list(AlignedSent)
-
-
class
nltk.corpus.reader.
TimitTaggedCorpusReader
(*args, **kwargs)[source]¶ Bases:
nltk.corpus.reader.tagged.TaggedCorpusReader
A corpus reader for tagged sentences that are included in the TIMIT corpus.
-
class
nltk.corpus.reader.
LinThesaurusCorpusReader
(root, badscore=0.0)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.
-
scored_synonyms
(ngram, fileid=None)[source]¶ Returns a list of scored synonyms (tuples of synonyms and scores) for the current ngram
Parameters: - ngram (C{string}) – ngram to lookup
- fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns: If fileid is specified, list of tuples of scores and synonyms; otherwise, list of tuples of fileids and lists, where inner lists consist of tuples of scores and synonyms.
-
similarity
(ngram1, ngram2, fileid=None)[source]¶ Returns the similarity score for two ngrams.
Parameters: - ngram1 (C{string}) – first ngram to compare
- ngram2 (C{string}) – second ngram to compare
- fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns: If fileid is specified, just the score for the two ngrams; otherwise, list of tuples of fileids and scores.
-
synonyms
(ngram, fileid=None)[source]¶ Returns a list of synonyms for the current ngram.
Parameters: - ngram (C{string}) – ngram to lookup
- fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns: If fileid is specified, list of synonyms; otherwise, list of tuples of fileids and lists, where inner lists contain synonyms.
-
-
class
nltk.corpus.reader.
SemcorCorpusReader
(root, fileids, wordnet, lazy=True)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the
xml()
method. For access to simple word lists and tagged word lists, usewords()
,sents()
,tagged_words()
, andtagged_sents()
.-
chunk_sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of chunks. Return type: list(list(list(str)))
-
chunks
(fileids=None)[source]¶ Returns: the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit. Return type: list(list(str))
-
sents
(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences, each encoded as a list of word strings. Return type: list(list(str))
-
tagged_chunks
(fileids=None, tag='pos')[source]¶ Returns: the given file(s) as a list of tagged chunks, represented in tree form. Return type: list(Tree) Parameters: tag – ‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
-
tagged_sents
(fileids=None, tag='pos')[source]¶ Returns: the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form). Return type: list(list(Tree)) Parameters: tag – ‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
-
-
class
nltk.corpus.reader.
FramenetCorpusReader
(root, fileids)[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
A corpus reader for the Framenet Corpus.
>>> from nltk.corpus import framenet as fn >>> fn.lu(3238).frame.lexUnit['glint.v'] is fn.lu(3238) True >>> fn.frame_by_name('Replacing') is fn.lus('replace.v')[0].frame True >>> fn.lus('prejudice.n')[0].frame.frameRelations == fn.frame_relations('Partiality') True
-
annotations
(luNamePattern=None, exemplars=True, full_text=True)[source]¶ Frame annotation sets matching the specified criteria.
-
doc
(fn_docid)[source]¶ Returns the annotated document whose id number is
fn_docid
. This id number can be obtained by calling the Documents() function.The dict that is returned from this function will contain the following keys:
‘_type’ : ‘fulltextannotation’
- ‘sentence’ : a list of sentences in the document
- Each item in the list is a dict containing the following keys:
‘ID’ : the ID number of the sentence
‘_type’ : ‘sentence’
‘text’ : the text of the sentence
‘paragNo’ : the paragraph number
‘sentNo’ : the sentence number
‘docID’ : the document ID number
‘corpID’ : the corpus ID number
‘aPos’ : the annotation position
- ‘annotationSet’ : a list of annotation layers for the sentence
- Each item in the list is a dict containing the following keys:
‘ID’ : the ID number of the annotation set
‘_type’ : ‘annotationset’
‘status’ : either ‘MANUAL’ or ‘UNANN’
‘luName’ : (only if status is ‘MANUAL’)
‘luID’ : (only if status is ‘MANUAL’)
‘frameID’ : (only if status is ‘MANUAL’)
‘frameName’: (only if status is ‘MANUAL’)
- ‘layer’ : a list of labels for the layer
Each item in the layer is a dict containing the following keys:
- ‘_type’: ‘layer’
- ‘rank’
- ‘name’
- ‘label’ : a list of labels in the layer
- Each item is a dict containing the following keys:
- ‘start’
- ‘end’
- ‘name’
- ‘feID’ (optional)
Parameters: fn_docid (int) – The Framenet id number of the document Returns: Information about the annotated document Return type: dict
-
docs
(name=None)[source]¶ Return a list of the annotated full-text documents in FrameNet, optionally filtered by a regex to be matched against the document name.
-
docs_metadata
(name=None)[source]¶ Return an index of the annotated documents in Framenet.
Details for a specific annotated document can be obtained using this class’s doc() function and pass it the value of the ‘ID’ field.
>>> from nltk.corpus import framenet as fn >>> len(fn.docs()) in (78, 107) # FN 1.5 and 1.7, resp. True >>> set([x.corpname for x in fn.docs_metadata()])>=set(['ANC', 'KBEval', 'LUCorpus-v0.3', 'Miscellaneous', 'NTI', 'PropBank']) True
Parameters: name (str) – A regular expression pattern used to search the file name of each annotated document. The document’s file name contains the name of the corpus that the document is from, followed by two underscores “__” followed by the document name. So, for example, the file name “LUCorpus-v0.3__20000410_nyt-NEW.xml” is from the corpus named “LUCorpus-v0.3” and the document name is “20000410_nyt-NEW.xml”. Returns: A list of selected (or all) annotated documents Return type: list of dicts, where each dict object contains the following keys: - ‘name’
- ‘ID’
- ‘corpid’
- ‘corpname’
- ‘description’
- ‘filename’
-
exemplars
(luNamePattern=None, frame=None, fe=None, fe2=None)[source]¶ Lexicographic exemplar sentences, optionally filtered by LU name and/or 1-2 FEs that are realized overtly. ‘frame’ may be a name pattern, frame ID, or frame instance. ‘fe’ may be a name pattern or FE instance; if specified, ‘fe2’ may also be specified to retrieve sentences with both overt FEs (in either order).
-
fe_relations
()[source]¶ Obtain a list of frame element relations.
>>> from nltk.corpus import framenet as fn >>> ferels = fn.fe_relations() >>> isinstance(ferels, list) True >>> len(ferels) in (10020, 12393) # FN 1.5 and 1.7, resp. True >>> PrettyDict(ferels[0], breakLines=True) {'ID': 14642, '_type': 'ferelation', 'frameRelation': <Parent=Abounding_with -- Inheritance -> Child=Lively_place>, 'subFE': <fe ID=11370 name=Degree>, 'subFEName': 'Degree', 'subFrame': <frame ID=1904 name=Lively_place>, 'subID': 11370, 'supID': 2271, 'superFE': <fe ID=2271 name=Degree>, 'superFEName': 'Degree', 'superFrame': <frame ID=262 name=Abounding_with>, 'type': <framerelationtype ID=1 name=Inheritance>}
Returns: A list of all of the frame element relations in framenet Return type: list(dict)
-
fes
(name=None, frame=None)[source]¶ Lists frame element objects. If ‘name’ is provided, this is treated as a case-insensitive regular expression to filter by frame name. (Case-insensitivity is because casing of frame element names is not always consistent across frames.) Specify ‘frame’ to filter by a frame name pattern, ID, or object.
>>> from nltk.corpus import framenet as fn >>> fn.fes('Noise_maker') [<fe ID=6043 name=Noise_maker>] >>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound')]) [('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'), ('Make_noise', 'Sound_source'), ('Sound_movement', 'Location_of_sound_source'), ('Sound_movement', 'Sound'), ('Sound_movement', 'Sound_source'), ('Sounds', 'Component_sound'), ('Sounds', 'Location_of_sound_source'), ('Sounds', 'Sound_source'), ('Vocalizations', 'Location_of_sound_source'), ('Vocalizations', 'Sound_source')] >>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound',r'(?i)make_noise')]) [('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'), ('Make_noise', 'Sound_source')] >>> sorted(set(fe.name for fe in fn.fes('^sound'))) ['Sound', 'Sound_maker', 'Sound_source'] >>> len(fn.fes('^sound$')) 2
Parameters: name (str) – A regular expression pattern used to match against frame element names. If ‘name’ is None, then a list of all frame elements will be returned. Returns: A list of matching frame elements Return type: list(AttrDict)
-
frame
(fn_fid_or_fname, ignorekeys=[])[source]¶ Get the details for the specified Frame using the frame’s name or id number.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame(256) >>> f.name 'Medical_specialties' >>> f = fn.frame('Medical_specialties') >>> f.ID 256 >>> # ensure non-ASCII character in definition doesn't trigger an encoding error: >>> fn.frame('Imposing_obligation') frame (1494): Imposing_obligation...
The dict that is returned from this function will contain the following information about the Frame:
‘name’ : the name of the Frame (e.g. ‘Birth’, ‘Apply_heat’, etc.)
‘definition’ : textual definition of the Frame
‘ID’ : the internal ID number of the Frame
- ‘semTypes’ : a list of semantic types for this frame
- Each item in the list is a dict containing the following keys:
- ‘name’ : can be used with the semtype() function
- ‘ID’ : can be used with the semtype() function
- ‘lexUnit’ : a dict containing all of the LUs for this frame.
The keys in this dict are the names of the LUs and the value for each key is itself a dict containing info about the LU (see the lu() function for more info.)
- ‘FE’ : a dict containing the Frame Elements that are part of this frame
The keys in this dict are the names of the FEs (e.g. ‘Body_system’) and the values are dicts containing the following keys
- ‘definition’ : The definition of the FE
- ‘name’ : The name of the FE e.g. ‘Body_system’
- ‘ID’ : The id number
- ‘_type’ : ‘fe’
- ‘abbrev’ : Abbreviation e.g. ‘bod’
- ‘coreType’ : one of “Core”, “Peripheral”, or “Extra-Thematic”
- ‘semType’ : if not None, a dict with the following two keys:
- ‘name’ : name of the semantic type. can be used with
- the semtype() function
- ‘ID’ : id number of the semantic type. can be used with
- the semtype() function
- ‘requiresFE’ : if not None, a dict with the following two keys:
- ‘name’ : the name of another FE in this frame
- ‘ID’ : the id of the other FE in this frame
- ‘excludesFE’ : if not None, a dict with the following two keys:
- ‘name’ : the name of another FE in this frame
- ‘ID’ : the id of the other FE in this frame
‘frameRelation’ : a list of objects describing frame relations
- ‘FEcoreSets’ : a list of Frame Element core sets for this frame
- Each item in the list is a list of FE objects
Parameters: Returns: Information about a frame
Return type:
-
frame_by_id
(fn_fid, ignorekeys=[])[source]¶ Get the details for the specified Frame using the frame’s id number.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame_by_id(256) >>> f.ID 256 >>> f.name 'Medical_specialties' >>> f.definition "This frame includes words that name ..."
Parameters: Returns: Information about a frame
Return type: Also see the
frame()
function for details about what is contained in the dict that is returned.
-
frame_by_name
(fn_fname, ignorekeys=[], check_cache=True)[source]¶ Get the details for the specified Frame using the frame’s name.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> f = fn.frame_by_name('Medical_specialties') >>> f.ID 256 >>> f.name 'Medical_specialties' >>> f.definition "This frame includes words that name ..."
Parameters: Returns: Information about a frame
Return type: Also see the
frame()
function for details about what is contained in the dict that is returned.
-
frame_ids_and_names
(name=None)[source]¶ Uses the frame index, which is much faster than looking up each frame definition if only the names and IDs are needed.
-
frame_relation_types
()[source]¶ Obtain a list of frame relation types.
>>> from nltk.corpus import framenet as fn >>> frts = list(fn.frame_relation_types()) >>> isinstance(frts, list) True >>> len(frts) in (9, 10) # FN 1.5 and 1.7, resp. True >>> PrettyDict(frts[0], breakLines=True) {'ID': 1, '_type': 'framerelationtype', 'frameRelations': [<Parent=Event -- Inheritance -> Child=Change_of_consistency>, <Parent=Event -- Inheritance -> Child=Rotting>, ...], 'name': 'Inheritance', 'subFrameName': 'Child', 'superFrameName': 'Parent'}
Returns: A list of all of the frame relation types in framenet Return type: list(dict)
-
frame_relations
(frame=None, frame2=None, type=None)[source]¶ Parameters: frame – (optional) frame object, name, or ID; only relations involving this frame will be returned :param frame2: (optional; ‘frame’ must be a different frame) only show relations between the two specified frames, in either direction :param type: (optional) frame relation type (name or object); show only relations of this type :type frame: int or str or AttrDict :return: A list of all of the frame relations in framenet :rtype: list(dict)
>>> from nltk.corpus import framenet as fn >>> frels = fn.frame_relations() >>> isinstance(frels, list) True >>> len(frels) in (1676, 2070) # FN 1.5 and 1.7, resp. True >>> PrettyList(fn.frame_relations('Cooking_creation'), maxReprSize=0, breakLines=True) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>, <Parent=Apply_heat -- Using -> Child=Cooking_creation>, <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>] >>> PrettyList(fn.frame_relations(274), breakLines=True) [<Parent=Avoiding -- Inheritance -> Child=Dodging>, <Parent=Avoiding -- Inheritance -> Child=Evading>, ...] >>> PrettyList(fn.frame_relations(fn.frame('Cooking_creation')), breakLines=True) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>, <Parent=Apply_heat -- Using -> Child=Cooking_creation>, ...] >>> PrettyList(fn.frame_relations('Cooking_creation', type='Inheritance')) [<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>] >>> PrettyList(fn.frame_relations('Cooking_creation', 'Apply_heat'), breakLines=True) [<Parent=Apply_heat -- Using -> Child=Cooking_creation>, <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
-
frames
(name=None)[source]¶ Obtain details for a specific frame.
>>> from nltk.corpus import framenet as fn >>> len(fn.frames()) in (1019, 1221) # FN 1.5 and 1.7, resp. True >>> x = PrettyList(fn.frames(r'(?i)crim'), maxReprSize=0, breakLines=True) >>> x.sort(key=lambda f: f.ID) >>> x [<frame ID=200 name=Criminal_process>, <frame ID=500 name=Criminal_investigation>, <frame ID=692 name=Crime_scenario>, <frame ID=700 name=Committing_crime>]
A brief intro to Frames (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):
A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and props that are needed for that Frame. For example, the “Apply_heat” frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.
We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called “lexical units” (LUs).
FrameNet includes relations between Frames. Several types of relations are defined, of which the most important are:
- Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
- Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to child FEs.
- Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.
- Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point of view, respectively.
Parameters: name (str) – A regular expression pattern used to match against Frame names. If ‘name’ is None, then a list of all Framenet Frames will be returned. Returns: A list of matching Frames (or all Frames). Return type: list(AttrDict)
-
frames_by_lemma
(pat)[source]¶ Returns a list of all frames that contain LUs in which the
name
attribute of the LU matchs the given regular expressionpat
. Note that LU names are composed of “lemma.POS”, where the “lemma” part can be made up of either a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’).Note: if you are going to be doing a lot of this type of searching, you’d want to build an index that maps from lemmas to frames because each time frames_by_lemma() is called, it has to search through ALL of the frame XML files in the db.
>>> from nltk.corpus import framenet as fn >>> fn.frames_by_lemma(r'(?i)a little') [<frame ID=189 name=Quanti...>, <frame ID=2001 name=Degree>]
Returns: A list of frame objects. Return type: list(AttrDict)
-
ft_sents
(docNamePattern=None)[source]¶ Full-text annotation sentences, optionally filtered by document name.
-
lu
(fn_luid, ignorekeys=[], luName=None, frameID=None, frameName=None)[source]¶ Access a lexical unit by its ID. luName, frameID, and frameName are used only in the event that the LU does not have a file in the database (which is the case for LUs with “Problem” status); in this case, a placeholder LU is created which just contains its name, ID, and frame.
Usage examples:
>>> from nltk.corpus import framenet as fn >>> fn.lu(256).name 'foresee.v' >>> fn.lu(256).definition 'COD: be aware of beforehand; predict.' >>> fn.lu(256).frame.name 'Expectation' >>> pprint(list(map(PrettyDict, fn.lu(256).lexemes))) [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}]
>>> fn.lu(227).exemplars[23] exemplar sentence (352962): [sentNo] 0 [aPos] 59699508 [LU] (227) guess.v in Coming_to_believe [frame] (23) Coming_to_believe [annotationSet] 2 annotation sets [POS] 18 tags [POS_tagset] BNC [GF] 3 relations [PT] 3 phrases [Other] 1 entry [text] + [Target] + [FE] When he was inside the house , Culley noticed the characteristic ------------------ Content he would n't have guessed at . -- ******* -- Co C1 [Evidence:INI] (Co=Cognizer, C1=Content)
The dict that is returned from this function will contain most of the following information about the LU. Note that some LUs do not contain all of these pieces of information - particularly ‘totalAnnotated’ and ‘incorporatedFE’ may be missing in some LUs:
‘name’ : the name of the LU (e.g. ‘merger.n’)
‘definition’ : textual definition of the LU
‘ID’ : the internal ID number of the LU
‘_type’ : ‘lu’
‘status’ : e.g. ‘Created’
‘frame’ : Frame that this LU belongs to
‘POS’ : the part of speech of this LU (e.g. ‘N’)
‘totalAnnotated’ : total number of examples annotated with this LU
‘incorporatedFE’ : FE that incorporates this LU (e.g. ‘Ailment’)
- ‘sentenceCount’ : a dict with the following two keys:
- ‘annotated’: number of sentences annotated with this LU
- ‘total’ : total number of sentences with this LU
- ‘lexemes’ : a list of dicts describing the lemma of this LU.
Each dict in the list contains these keys: - ‘POS’ : part of speech e.g. ‘N’ - ‘name’ : either single-lexeme e.g. ‘merger’ or
multi-lexeme e.g. ‘a little’
‘order’: the order of the lexeme in the lemma (starting from 1)
‘headword’: a boolean (‘true’ or ‘false’)
- ‘breakBefore’: Can this lexeme be separated from the previous lexeme?
- Consider: “take over.v” as in:
Germany took over the Netherlands in 2 days. Germany took the Netherlands over in 2 days.
In this case, ‘breakBefore’ would be “true” for the lexeme “over”. Contrast this with “take after.v” as in:
Mary takes after her grandmother.
*Mary takes her grandmother after.
In this case, ‘breakBefore’ would be “false” for the lexeme “after”
‘lemmaID’ : Can be used to connect lemmas in different LUs
‘semTypes’ : a list of semantic type objects for this LU
- ‘subCorpus’ : a list of subcorpora
- Each item in the list is a dict containing the following keys:
- ‘name’ :
- ‘sentence’ : a list of sentences in the subcorpus
- each item in the list is a dict with the following keys:
- ‘ID’:
- ‘sentNo’:
- ‘text’: the text of the sentence
- ‘aPos’:
- ‘annotationSet’: a list of annotation sets
- each item in the list is a dict with the following keys:
- ‘ID’:
- ‘status’:
- ‘layer’: a list of layers
- each layer is a dict containing the following keys:
- ‘name’: layer name (e.g. ‘BNC’)
- ‘rank’:
- ‘label’: a list of labels for the layer
- each label is a dict containing the following keys:
- ‘start’: start pos of label in sentence ‘text’ (0-based)
- ‘end’: end pos of label in sentence ‘text’ (0-based)
- ‘name’: name of label (e.g. ‘NN1’)
Under the hood, this implementation looks up the lexical unit information in the frame definition file. That file does not contain corpus annotations, so the LU files will be accessed on demand if those are needed. In principle, valence patterns could be loaded here too, though these are not currently supported.
Parameters: Returns: All information about the lexical unit
Return type:
-
lu_basic
(fn_luid)[source]¶ Returns basic information about the LU whose id is
fn_luid
. This is basically just a wrapper around thelu()
function with “subCorpus” info excluded.>>> from nltk.corpus import framenet as fn >>> lu = PrettyDict(fn.lu_basic(256), breakLines=True) >>> # ellipses account for differences between FN 1.5 and 1.7 >>> lu {'ID': 256, 'POS': 'V', 'URL': u'https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu256.xml', '_type': 'lu', 'cBy': ..., 'cDate': '02/08/2001 01:27:50 PST Thu', 'definition': 'COD: be aware of beforehand; predict.', 'definitionMarkup': 'COD: be aware of beforehand; predict.', 'frame': <frame ID=26 name=Expectation>, 'lemmaID': 15082, 'lexemes': [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}], 'name': 'foresee.v', 'semTypes': [], 'sentenceCount': {'annotated': ..., 'total': ...}, 'status': 'FN1_Sent'}
Parameters: fn_luid (int) – The id number of the desired LU Returns: Basic information about the lexical unit Return type: dict
-
lu_ids_and_names
(name=None)[source]¶ Uses the LU index, which is much faster than looking up each LU definition if only the names and IDs are needed.
-
lus
(name=None, frame=None)[source]¶ Obtain details for lexical units. Optionally restrict by lexical unit name pattern, and/or to a certain frame or frames whose name matches a pattern.
>>> from nltk.corpus import framenet as fn >>> len(fn.lus()) in (11829, 13572) # FN 1.5 and 1.7, resp. True >>> PrettyList(fn.lus(r'(?i)a little'), maxReprSize=0, breakLines=True) [<lu ID=14744 name=a little bit.adv>, <lu ID=14733 name=a little.n>, <lu ID=14743 name=a little.adv>] >>> fn.lus(r'interest', r'(?i)stimulus') [<lu ID=14920 name=interesting.a>, <lu ID=14894 name=interested.a>]
A brief intro to Lexical Units (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):
A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat” Frame describes a common situation involving a Cook, some Food, and a Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of a polysemous word is a different LU.
We have used the word “word” in talking about LUs. The reality is actually rather complex. When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different frames:
- Apply_heat: “Michelle baked the potatoes for 45 minutes.”
- Cooking_creation: “Michelle baked her mother a cake for her birthday.”
- Absorb_heat: “The potatoes have to bake for more than 30 minutes.”
These constitute three different LUs, with different definitions.
Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively), and their internal structure is not analyzed.
Framenet provides multiple annotated examples of each sense of a word (i.e. each LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial possibilities of the lexical unit.
Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database similar to a thesaurus, grouping together semantically similar words.
In the simplest case, frame-evoking words are verbs such as “fried” in:
“Matilde fried the catfish in a heavy iron skillet.”Sometimes event nouns may evoke a Frame. For example, “reduction” evokes “Cause_change_of_scalar_position” in:
”...the reduction of debt levels to $665 million from $2.6 billion.”Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as in:
“They were asleep for hours.”Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents rather than clearly evoking their own frames.
Parameters: name (str) – A regular expression pattern used to search the LU names. Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a lemma preceeds the ”.” and a POS follows the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). If ‘name’ is not given, then all LUs will be returned.
The valid POSes are:
v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj - interjection art - article c - conjunction scon - subordinating conjunctionReturns: A list of selected (or all) lexical units Return type: list of LU objects (dicts) See the lu() function for info about the specifics of LU objects.
-
propagate_semtypes
()[source]¶ Apply inference rules to distribute semtypes over relations between FEs. For FrameNet 1.5, this results in 1011 semtypes being propagated. (Not done by default because it requires loading all frame files, which takes several seconds. If this needed to be fast, it could be rewritten to traverse the neighboring relations on demand for each FE semtype.)
>>> from nltk.corpus import framenet as fn >>> x = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType) >>> fn.propagate_semtypes() >>> y = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType) >>> y-x > 1000 True
-
semtype
(key)[source]¶ >>> from nltk.corpus import framenet as fn >>> fn.semtype(233).name 'Temperature' >>> fn.semtype(233).abbrev 'Temp' >>> fn.semtype('Temperature').ID 233
Parameters: key (string or int) – The name, abbreviation, or id number of the semantic type Returns: Information about a semantic type Return type: dict
-
semtypes
()[source]¶ Obtain a list of semantic types.
>>> from nltk.corpus import framenet as fn >>> stypes = fn.semtypes() >>> len(stypes) in (73, 109) # FN 1.5 and 1.7, resp. True >>> sorted(stypes[0].keys()) ['ID', '_type', 'abbrev', 'definition', 'definitionMarkup', 'name', 'rootType', 'subTypes', 'superType']
Returns: A list of all of the semantic types in framenet Return type: list(dict)
-
warnings
(v)[source]¶ Enable or disable warnings of data integrity issues as they are encountered. If v is truthy, warnings will be enabled.
(This is a function rather than just an attribute/property to ensure that if enabling warnings is the first action taken, the corpus reader is instantiated first.)
-
-
class
nltk.corpus.reader.
UdhrCorpusReader
(root='udhr')[source]¶ Bases:
nltk.corpus.reader.plaintext.PlaintextCorpusReader
-
ENCODINGS
= [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]¶
-
SKIP
= {'Burmese_Myanmar-UTF8', 'Amharic-Afenegus6..60375', 'Vietnamese-VIQR', 'Esperanto-T61', 'Burmese_Myanmar-WinResearcher', 'Armenian-DallakHelv', 'Magahi-UTF8', 'Magahi-Agra', 'Hungarian_Magyar-Unicode', 'Gujarati-UTF8', 'Bhojpuri-Agra', 'Russian_Russky-UTF8~', 'Tigrinya_Tigrigna-VG2Main', 'Chinese_Mandarin-UTF8', 'Marathi-UTF8', 'Chinese_Mandarin-HZ', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Vietnamese-TCVN', 'Japanese_Nihongo-JIS', 'Navaho_Dine-Navajo-Navaho-font', 'Lao-UTF8', 'Czech-Latin2-err', 'Vietnamese-VPS', 'Tamil-UTF8', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117'}¶
-
-
class
nltk.corpus.reader.
BNCCorpusReader
(root, fileids, lazy=True)[source] Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
Corpus reader for the XML version of the British National Corpus.
For access to the complete XML data structure, use the
xml()
method. For access to simple word lists and tagged word lists, usewords()
,sents()
,tagged_words()
, andtagged_sents()
.You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554
If you extracted the archive to a directory called BNC, then you can instantiate the reader as:
BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
-
sents
(fileids=None, strip_space=True, stem=False)[source] Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type: Parameters: - strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
tagged_sents
(fileids=None, c5=False, strip_space=True, stem=False)[source] Returns: the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.Return type: Parameters: - c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
tagged_words
(fileids=None, c5=False, strip_space=True, stem=False)[source] Returns: the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples
(word,tag)
.Return type: Parameters: - c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
- strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
words
(fileids=None, strip_space=True, stem=False)[source] Returns: the given file(s) as a list of words and punctuation symbols.
Return type: Parameters: - strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
- stem – If true, then use word stems instead of word strings.
-
-
class
nltk.corpus.reader.
SentiWordNetCorpusReader
(root, fileids, encoding='utf-8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.
SentiSynset
(pos_score, neg_score, synset)[source]¶ Bases:
object
-
unicode_repr
()¶
-
-
class
nltk.corpus.reader.
TwitterCorpusReader
(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.
Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.
Construct a new Tweet corpus reader for a set of documents located at the given root directory.
If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:
from nltk.corpus import TwitterCorpusReader reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')
However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:
root = os.environ['TWITTER'] reader = TwitterCorpusReader(root, '.*\.json')
If you want to work directly with the raw Tweets, the json library can be used:
import json for tweet in reader.docs(): print(json.dumps(tweet, indent=1, sort_keys=True))
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
docs
(fileids=None)[source]¶ Returns the full Tweet objects, as specified by Twitter documentation on Tweets
Returns: the given file(s) as a list of dictionaries deserialised from JSON. :rtype: list(dict)
-
-
class
nltk.corpus.reader.
NKJPCorpusReader
(root, fileids='.*')[source]¶ Bases:
nltk.corpus.reader.xmldocs.XMLCorpusReader
-
HEADER_MODE
= 2¶
-
RAW_MODE
= 3¶
-
SENTS_MODE
= 1¶
-
WORDS_MODE
= 0¶
-
-
class
nltk.corpus.reader.
CrubadanCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader used to access language An Crubadan n-gram files.
-
class
nltk.corpus.reader.
MTECorpusReader
(root=None, fileids=None, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.tagged.TaggedCorpusReader
Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset
-
lemma_paras
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma) Return type: list(List(List(tuple(str, str))))
-
lemma_sents
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma) Return type: list(list(tuple(str, str)))
-
lemma_words
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma) Return type: list(tuple(str,str))
-
paras
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string Return type: list(list(list(str)))
-
raw
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a single string. Return type: str
-
readme
()[source]¶ Prints some information about this corpus. :return: the content of the attached README file :rtype: str
-
sents
(fileids=None)[source]¶ param fileids: A list specifying the fileids that should be used. Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings Return type: list(list(str))
-
tagged_paras
(fileids=None, tagset='msd', tags='')[source]¶ param fileids: A list specifying the fileids that should be used. Parameters: - tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
- tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples
Return type:
-
tagged_sents
(fileids=None, tagset='msd', tags='')[source]¶ param fileids: A list specifying the fileids that should be used. Parameters: - tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
- tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns: the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples
Return type:
-
tagged_words
(fileids=None, tagset='msd', tags='')[source]¶ param fileids: A list specifying the fileids that should be used. Parameters: - tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default
- tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag
Returns: the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag)
Return type:
-
-
class
nltk.corpus.reader.
ReviewsCorpusReader
(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.
>>> from nltk.corpus import product_reviews_1 >>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt') >>> review = camera_reviews[0] >>> review.sents()[0] ['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am', 'extremely', 'satisfied', 'with', 'the', 'purchase', '.'] >>> review.features() [('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'), ('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'), ('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'), ('option', '+1')]
We can also reach the same information directly from the stream:
>>> product_reviews_1.features('Canon_G3.txt') [('canon powershot g3', '+3'), ('use', '+2'), ...]
We can compute stats for specific product features:
>>> from __future__ import division >>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> # We use float for backward compatibility with division in Python2.7 >>> mean = tot / n_reviews >>> print(n_reviews, tot, mean) 15 24 1.6
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
features
(fileids=None)[source]¶ Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.
Parameters: fileids – a list or regexp specifying the ids of the files whose features have to be returned. Returns: all features for the item(s) in the given file(s). Return type: list(tuple)
-
raw
(fileids=None)[source]¶ Parameters: fileids – a list or regexp specifying the fileids of the files that have to be returned as a raw string. Returns: the given file(s) as a single string. Return type: str
-
reviews
(fileids=None)[source]¶ Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.
Parameters: fileids – a list or regexp specifying the ids of the files whose reviews have to be returned. Returns: the given file(s) as a list of reviews.
-
sents
(fileids=None)[source]¶ Return all sentences in the corpus or in the specified files.
Parameters: fileids – a list or regexp specifying the ids of the files whose sentences have to be returned. Returns: the given file(s) as a list of sentences, each encoded as a list of word strings. Return type: list(list(str))
-
words
(fileids=None)[source]¶ Return all words and punctuation symbols in the corpus or in the specified files.
Parameters: fileids – a list or regexp specifying the ids of the files whose words have to be returned. Returns: the given file(s) as a list of words and punctuation symbols. Return type: list(str)
-
-
class
nltk.corpus.reader.
OpinionLexiconCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.wordlist.WordListCorpusReader
Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored.
>>> from nltk.corpus import opinion_lexicon >>> opinion_lexicon.words() ['2-faced', '2-faces', 'abnormal', 'abolish', ...]
The OpinionLexiconCorpusReader provides shortcuts to retrieve positive/negative words:
>>> opinion_lexicon.negative() ['2-faced', '2-faces', 'abnormal', 'abolish', ...]
Note that words from words() method are sorted by file id, not alphabetically:
>>> opinion_lexicon.words()[0:10] ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted'] >>> sorted(opinion_lexicon.words())[0:10] ['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort']
-
CorpusView
¶ alias of
IgnoreReadmeCorpusView
-
negative
()[source]¶ Return all negative words in alphabetical order.
Returns: a list of negative words. Return type: list(str)
-
positive
()[source]¶ Return all positive words in alphabetical order.
Returns: a list of positive words. Return type: list(str)
-
words
(fileids=None)[source]¶ Return all words in the opinion lexicon. Note that these words are not sorted in alphabetical order.
Parameters: fileids – a list or regexp specifying the ids of the files whose words have to be returned. Returns: the given file(s) as a list of words and punctuation symbols. Return type: list(str)
-
-
class
nltk.corpus.reader.
ProsConsCorpusReader
(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), encoding='utf8', **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.api.CorpusReader
Reader for the Pros and Cons sentence dataset.
>>> from nltk.corpus import pros_cons >>> pros_cons.sents(categories='Cons') [['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy', 'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'], ...] >>> pros_cons.words('IntegratedPros.txt') ['Easy', 'to', 'use', ',', 'economical', '!', ...]
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
sents
(fileids=None, categories=None)[source]¶ Return all sentences in the corpus or in the specified files/categories.
Parameters: - fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
- categories – a list specifying the categories whose sentences have to be returned.
Returns: the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.
Return type:
-
words
(fileids=None, categories=None)[source]¶ Return all words and punctuation symbols in the corpus or in the specified files/categories.
Parameters: - fileids – a list or regexp specifying the ids of the files whose words have to be returned.
- categories – a list specifying the categories whose words have to be returned.
Returns: the given file(s) as a list of words and punctuation symbols.
Return type:
-
-
class
nltk.corpus.reader.
CategorizedSentencesCorpusReader
(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=None, encoding='utf8', **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CategorizedCorpusReader
,nltk.corpus.reader.api.CorpusReader
A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.
Examples using the Subjectivity Dataset:
>>> from nltk.corpus import subjectivity >>> subjectivity.sents()[23] ['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits', 'happened', 'off', 'screen', '.'] >>> subjectivity.categories() ['obj', 'subj'] >>> subjectivity.words(categories='subj') ['smart', 'and', 'alert', ',', 'thirteen', ...]
Examples using the Sentence Polarity Dataset:
>>> from nltk.corpus import sentence_polarity >>> sentence_polarity.sents() [['simplistic', ',', 'silly', 'and', 'tedious', '.'], ["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.'], ...] >>> sentence_polarity.categories() ['neg', 'pos']
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
raw
(fileids=None, categories=None)[source]¶ Parameters: - fileids – a list or regexp specifying the fileids that have to be returned as a raw string.
- categories – a list specifying the categories whose files have to be returned as a raw string.
Returns: the given file(s) as a single string.
Return type:
-
sents
(fileids=None, categories=None)[source]¶ Return all sentences in the corpus or in the specified file(s).
Parameters: - fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
- categories – a list specifying the categories whose sentences have to be returned.
Returns: the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.
Return type:
-
words
(fileids=None, categories=None)[source]¶ Return all words and punctuation symbols in the corpus or in the specified file(s).
Parameters: - fileids – a list or regexp specifying the ids of the files whose words have to be returned.
- categories – a list specifying the categories whose words have to be returned.
Returns: the given file(s) as a list of words and punctuation symbols.
Return type:
-
-
class
nltk.corpus.reader.
ComparativeSentencesCorpusReader
(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=None, encoding='utf8')[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).
>>> from nltk.corpus import comparative_sentences >>> comparison = comparative_sentences.comparisons()[0] >>> comparison.text ['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly', 'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve", 'had', '.'] >>> comparison.entity_2 'models' >>> (comparison.feature, comparison.keyword) ('rewind', 'more') >>> len(comparative_sentences.comparisons()) 853
-
CorpusView
¶ alias of
StreamBackedCorpusView
-
comparisons
(fileids=None)[source]¶ Return all comparisons in the corpus.
Parameters: fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned. Returns: the given file(s) as a list of Comparison objects. Return type: list(Comparison)
-
keywords
(fileids=None)[source]¶ Return a set of all keywords used in the corpus.
Parameters: fileids – a list or regexp specifying the ids of the files whose keywords have to be returned. Returns: the set of keywords and comparative phrases used in the corpus. Return type: set(str)
-
keywords_readme
()[source]¶ Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).
-
raw
(fileids=None)[source]¶ Parameters: fileids – a list or regexp specifying the fileids that have to be returned as a raw string. Returns: the given file(s) as a single string. Return type: str
-
sents
(fileids=None)[source]¶ Return all sentences in the corpus.
Parameters: fileids – a list or regexp specifying the ids of the files whose sentences have to be returned. Returns: all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified). Return type: list(list(str)) or list(str)
-
-
class
nltk.corpus.reader.
PanLexLiteCorpusReader
(root)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
-
MEANING_Q
= '\n SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n ORDER BY dnx2.uq DESC\n '¶
-
TRANSLATION_Q
= '\n SELECT s.tt, sum(s.uq) AS trq FROM (\n SELECT ex2.tt, max(dnx.uq) AS uq\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n GROUP BY ex2.tt, dnx.ui\n ) s\n GROUP BY s.tt\n ORDER BY trq DESC, s.tt\n '¶
-
language_varieties
(lc=None)[source]¶ Return a list of PanLex language varieties.
Parameters: lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned. Returns: the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name. Return type: list(tuple)
-
meanings
(expr_uid, expr_tt)[source]¶ Return a list of meanings for an expression.
Parameters: - expr_uid – the expression’s language variety, as a seven-character uniform identifier.
- expr_tt – the expression’s text.
Returns: a list of Meaning objects.
Return type:
-
translations
(from_uid, from_tt, to_uid)[source]¶ - Return a list of translations for an expression into a single language
- variety.
Parameters: - from_uid – the source expression’s language variety, as a seven-character uniform identifier.
- from_tt – the source expression’s text.
- to_uid – the target language variety, as a seven-character uniform identifier.
- :return a list of translation tuples. The first element is the expression
- text and the second element is the translation quality.
Return type: list(tuple)
-
-
class
nltk.corpus.reader.
NonbreakingPrefixesCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.wordlist.WordListCorpusReader
This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses’ word tokenizer.
-
available_langs
= {'lv': 'lv', 'de': 'de', 'tamil': 'ta', 'spanish': 'es', 'hu': 'hu', 'latvian': 'lv', 'slovenian': 'sl', 'sv': 'sv', 'finnish': 'fi', 'ca': 'ca', 'icelandic': 'is', 'french': 'fr', 'greek': 'el', 'english': 'en', 'fr': 'fr', 'slovak': 'sk', 'is': 'is', 'pt': 'pt', 'czech': 'cs', 'pl': 'pl', 'en': 'en', 'polish': 'pl', 'sl': 'sl', 'italian': 'it', 'catalan': 'ca', 'ro': 'ro', 'fi': 'fi', 'portuguese': 'pt', 'el': 'el', 'cs': 'cs', 'sk': 'sk', 'hungarian': 'hu', 'russian': 'ru', 'swedish': 'sv', 'ru': 'ru', 'romanian': 'ro', 'ta': 'ta', 'dutch': 'nl', 'es': 'es', 'nl': 'nl', 'german': 'de', 'it': 'it'}¶
-
words
(lang=None, fileids=None, ignore_lines_startswith='#')[source]¶ This module returns a list of nonbreaking prefixes for the specified language(s).
>>> from nltk.corpus import nonbreaking_prefixes as nbp >>> nbp.words('en')[:10] == [u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J'] True >>> nbp.words('ta')[:5] == [u'அ', u'ஆ', u'இ', u'ஈ', u'உ'] True
Returns: a list words for the specified language(s).
-
-
class
nltk.corpus.reader.
UnicharsCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.wordlist.WordListCorpusReader
This class is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm
-
available_categories
= ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'Open_Punctuation']¶
-
chars
(category=None, fileids=None)[source]¶ This module returns a list of characters from the Perl Unicode Properties. They are very useful when porting Perl tokenizers to Python.
>>> from nltk.corpus import perluniprops as pup >>> pup.chars('Open_Punctuation')[:5] == [u'(', u'[', u'{', u'༺', u'༼'] True >>> pup.chars('Currency_Symbol')[:5] == [u'$', u'¢', u'£', u'¤', u'¥'] True >>> pup.available_categories ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'Open_Punctuation']
Returns: a list of characters given the specific unicode character category
-
-
class
nltk.corpus.reader.
MWAPPDBCorpusReader
(root, fileids, encoding='utf8', tagset=None)[source]¶ Bases:
nltk.corpus.reader.wordlist.WordListCorpusReader
This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al. (2014a, 2014b, 2015):
The original source of the full PPDB corpus can be found on http://www.cis.upenn.edu/~ccb/ppdb/
Returns: a list of tuples of similar lexical terms. -
entries
(fileids='ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs')[source]¶ Returns: a tuple of synonym word pairs.
-
mwa_ppdb_xxxl_file
= 'ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs'¶
-