nltk.corpus package¶
Subpackages¶
- nltk.corpus.reader package
- Submodules
- nltk.corpus.reader.aligned module
- nltk.corpus.reader.api module
- nltk.corpus.reader.bnc module
- nltk.corpus.reader.bracket_parse module
- nltk.corpus.reader.categorized_sents module
- nltk.corpus.reader.chasen module
- nltk.corpus.reader.childes module
- nltk.corpus.reader.chunked module
- nltk.corpus.reader.cmudict module
- nltk.corpus.reader.comparative_sents module
- nltk.corpus.reader.conll module
- nltk.corpus.reader.crubadan module
- nltk.corpus.reader.dependency module
- nltk.corpus.reader.framenet module
- nltk.corpus.reader.ieer module
- nltk.corpus.reader.indian module
- nltk.corpus.reader.ipipan module
- nltk.corpus.reader.knbc module
- nltk.corpus.reader.lin module
- nltk.corpus.reader.mte module
- nltk.corpus.reader.nkjp module
- nltk.corpus.reader.nombank module
- nltk.corpus.reader.nps_chat module
- nltk.corpus.reader.opinion_lexicon module
- nltk.corpus.reader.panlex_lite module
- nltk.corpus.reader.pl196x module
- nltk.corpus.reader.plaintext module
- nltk.corpus.reader.ppattach module
- nltk.corpus.reader.propbank module
- nltk.corpus.reader.pros_cons module
- nltk.corpus.reader.reviews module
- nltk.corpus.reader.rte module
- nltk.corpus.reader.semcor module
- nltk.corpus.reader.senseval module
- nltk.corpus.reader.sentiwordnet module
- nltk.corpus.reader.sinica_treebank module
- nltk.corpus.reader.string_category module
- nltk.corpus.reader.switchboard module
- nltk.corpus.reader.tagged module
- nltk.corpus.reader.timit module
- nltk.corpus.reader.toolbox module
- nltk.corpus.reader.twitter module
- nltk.corpus.reader.udhr module
- nltk.corpus.reader.util module
- nltk.corpus.reader.verbnet module
- nltk.corpus.reader.wordlist module
- nltk.corpus.reader.wordnet module
- nltk.corpus.reader.xmldocs module
- nltk.corpus.reader.ycoe module
- Module contents
Submodules¶
nltk.corpus.europarl_raw module¶
nltk.corpus.util module¶
-
class
nltk.corpus.util.
LazyCorpusLoader
(name, reader_cls, *args, **kwargs)[source]¶ Bases:
object
To see the API documentation for this lazily loaded corpus, first run corpus.ensure_loaded(), and then run help(this_corpus).
LazyCorpusLoader is a proxy object which is used to stand in for a corpus object before the corpus is loaded. This allows NLTK to create an object for each corpus, but defer the costs associated with loading those corpora until the first time that they’re actually accessed.
The first time this object is accessed in any way, it will load the corresponding corpus, and transform itself into that corpus (by modifying its own
__class__
and__dict__
attributes).If the corpus can not be found, then accessing this object will raise an exception, displaying installation instructions for the NLTK data package. Once they’ve properly installed the data package (or modified
nltk.data.path
to point to its location), they can then use the corpus object without restarting python.Parameters: - name (str) – The name of the corpus
- reader_cls – The specific CorpusReader class, e.g. PlaintextCorpusReader, WordListCorpusReader
- nltk_data_subdir (str) – The subdirectory where the corpus is stored.
- *args –
Any other non-keywords arguments that reader_cls might need.
- *kargs –
Any other keywords arguments that reader_cls might need.
-
unicode_repr
()¶
Module contents¶
NLTK corpus readers. The modules in this package provide functions that can be used to read corpus files in a variety of formats. These functions can be used to read both the corpus files that are distributed in the NLTK corpus package, and corpus files that are part of external corpora.
Available Corpora¶
Please see http://www.nltk.org/nltk_data/ for a complete list. Install corpora using nltk.download().
Corpus Reader Functions¶
Each corpus module defines one or more “corpus reader functions”,
which can be used to read documents from that corpus. These functions
take an argument, item
, which is used to indicate which document
should be read from the corpus:
- If
item
is one of the unique identifiers listed in the corpus module’sitems
variable, then the corresponding document will be loaded from the NLTK corpus package. - If
item
is a filename, then that file will be read.
Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.
Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:
- words(): list of str
- sents(): list of (list of str)
- paras(): list of (list of (list of str))
- tagged_words(): list of (str,str) tuple
- tagged_sents(): list of (list of (str,str))
- tagged_paras(): list of (list of (list of (str,str)))
- chunked_sents(): list of (Tree w/ (str,str) leaves)
- parsed_sents(): list of (Tree with str leaves)
- parsed_paras(): list of (list of (Tree with str leaves))
- xml(): A single xml ElementTree
- raw(): unprocessed corpus contents
For example, to read a list of the words in the Brown Corpus, use
nltk.corpus.brown.words()
:
>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()))
The, Fulton, County, Grand, Jury, said, ...