Package nltk :: Package corpus

Package corpus

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus files in a variety of formats. These functions can be used to read both the corpus files that are distributed in the NLTK corpus package, and corpus files that are part of external corpora.

Corpus Reader Functions

Each corpus module defines one or more corpus reader functions, which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

If item is one of the unique identifiers listed in the corpus module's items variable, then the corresponding document will be loaded from the NLTK corpus package.
If item is a filename, then that file will be read.

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

corpus.words(): list of str
corpus.sents(): list of (list of str)
corpus.paras(): list of (list of (list of str))
corpus.tagged_words(): list of (str,str) tuple
corpus.tagged_sents(): list of (list of (str,str))
corpus.tagged_paras(): list of (list of (list of (str,str)))
corpus.chunked_sents(): list of (Tree w/ (str,str) leaves)
corpus.parsed_sents(): list of (Tree with str leaves)
corpus.parsed_paras(): list of (list of (Tree with str leaves))
corpus.xml(): A single xml ElementTree
corpus.raw(): unprocessed corpus contents

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Corpus Metadata

Metadata about the NLTK corpora, and their individual documents, is stored using Open Language Archives Community (OLAC) metadata records. These records can be accessed using nltk.corpus.corpus.olac().

Submodules

[hide private]

nltk.corpus.chat80: Chat-80 was a natural language system which allowed the user to interrogate a Prolog knowledge base in the domain of world geography.
nltk.corpus.reader: Corpus readers.
- nltk.corpus.reader.api: API for corpus readers.
- nltk.corpus.reader.bnc: Corpus reader for the XML version of the British National Corpus.
- nltk.corpus.reader.bracket_parse
- nltk.corpus.reader.chunked: A reader for corpora that contain chunked (and optionally tagged) documents.
- nltk.corpus.reader.cmudict: The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6] ftp://ftp.cs.cmu.edu/project/speech/dict/ Copyright 1998 Carnegie Mellon University
- nltk.corpus.reader.conll: Read CoNLL-style chunk files.
- nltk.corpus.reader.ieer: Corpus reader for the Information Extraction and Entity Recognition Corpus.
- nltk.corpus.reader.indian: Indian Language POS-Tagged Corpus Collected by A Kumaran, Microsoft Research, India Distributed with permission
- nltk.corpus.reader.nps_chat
- nltk.corpus.reader.plaintext: A reader for corpora that consist of plaintext documents.
- nltk.corpus.reader.ppattach: Read lines from the Prepositional Phrase Attachment Corpus.
- nltk.corpus.reader.propbank
- nltk.corpus.reader.rte: Corpus reader for the Recognizing Textual Entailment (RTE) Challenge Corpora.
- nltk.corpus.reader.senseval: Read from the Senseval 2 Corpus.
- nltk.corpus.reader.sinica_treebank: Sinica Treebank Corpus Sample
- nltk.corpus.reader.string_category: Read tuples from a corpus consisting of categorized strings.
- nltk.corpus.reader.tagged: A reader for corpora whose documents contain part-of-speech-tagged words.
- nltk.corpus.reader.timit: Read tokens, phonemes and audio data from the NLTK TIMIT Corpus.
- nltk.corpus.reader.toolbox: Module for reading, writing and manipulating Toolbox databases and settings files.
- nltk.corpus.reader.util
- nltk.corpus.reader.verbnet
- nltk.corpus.reader.wordlist
- nltk.corpus.reader.xmldocs: Corpus reader for corpora whose documents are xml files.
- nltk.corpus.reader.ycoe: Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.
nltk.corpus.util

Functions

[hide private]

demo()

source code

Variables

[hide private]

abc = <PlaintextCorpusReader in '/usr/share/nltk_data/corpora/...

alpino = <AlpinoCorpusReader in '/usr/share/nltk_data/corpora/...

brown = <CategorizedTaggedCorpusReader in '/usr/share/nltk_dat...

cess_cat = <BracketParseCorpusReader in '/usr/share/nltk_data/...

cess_esp = <BracketParseCorpusReader in '/usr/share/nltk_data/...

cmudict = <CMUDictCorpusReader in '/usr/share/nltk_data/corpor...

conll2000 = <ConllChunkCorpusReader in '/usr/share/nltk_data/c...

conll2002 = <ConllChunkCorpusReader in '/usr/share/nltk_data/c...

floresta = <BracketParseCorpusReader in '/usr/share/nltk_data/...

genesis = <PlaintextCorpusReader in '/usr/share/nltk_data/corp...

gutenberg = <PlaintextCorpusReader in '/usr/share/nltk_data/co...

hebrew_treebank = LazyCorpusLoader('hebrew_treebank', BracketP...

ieer = <IEERCorpusReader in '/usr/share/nltk_data/corpora/ieer'>

inaugural = <PlaintextCorpusReader in '/usr/share/nltk_data/co...

indian = <IndianCorpusReader in '/usr/share/nltk_data/corpora/...

mac_morpho = <MacMorphoCorpusReader in '/usr/share/nltk_data/c...

movie_reviews = <CategorizedPlaintextCorpusReader in '/usr/sha...

names = <WordListCorpusReader in '/usr/share/nltk_data/corpora...

nps_chat = <NPSChatCorpusReader in '/usr/share/nltk_data/corpo...

ppattach = <PPAttachmentCorpusReader in '/usr/share/nltk_data/...

propbank = <PropbankCorpusReader in '/usr/share/nltk_data/corp...

qc = <StringCategoryCorpusReader in '/usr/share/nltk_data/corp...

reuters = <CategorizedPlaintextCorpusReader in '/usr/share/nlt...

rte = <RTECorpusReader in '/usr/share/nltk_data/corpora/rte'>

senseval = <SensevalCorpusReader in '/usr/share/nltk_data/corp...

shakespeare = <XMLCorpusReader in '/usr/share/nltk_data/corpor...

sinica_treebank = <SinicaTreebankCorpusReader in '/usr/share/n...

state_union = <PlaintextCorpusReader in '/usr/share/nltk_data/...

stopwords = <WordListCorpusReader in '/usr/share/nltk_data/cor...

timit = <TimitCorpusReader in '/usr/share/nltk_data/corpora/ti...

toolbox = <ToolboxCorpusReader in '/usr/share/nltk_data/corpor...

treebank = <BracketParseCorpusReader in '/usr/share/nltk_data/...

treebank_chunk = <ChunkedCorpusReader in '/usr/share/nltk_data...

treebank_raw = <PlaintextCorpusReader in '/usr/share/nltk_data...

udhr = <PlaintextCorpusReader in '/usr/share/nltk_data/corpora...

verbnet = <VerbnetCorpusReader in '/usr/share/nltk_data/corpor...

webtext = <PlaintextCorpusReader in '/usr/share/nltk_data/corp...

words = <WordListCorpusReader in '/usr/share/nltk_data/corpora...

ycoe = LazyCorpusLoader('ycoe', YCOECorpusReader)

Variables Details

[hide private]

abc

Value:

<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/abc'>

alpino

Value:

<AlpinoCorpusReader in '/usr/share/nltk_data/corpora/alpino'>

brown

Value:

<CategorizedTaggedCorpusReader in '/usr/share/nltk_data/corpora/brown'
>

cess_cat

Value:

<BracketParseCorpusReader in '/usr/share/nltk_data/corpora/cess_cat'>

cess_esp

Value:

<BracketParseCorpusReader in '/usr/share/nltk_data/corpora/cess_esp'>

cmudict

Value:

<CMUDictCorpusReader in '/usr/share/nltk_data/corpora/cmudict'>

conll2000

Value:

<ConllChunkCorpusReader in '/usr/share/nltk_data/corpora/conll2000'>

conll2002

Value:

<ConllChunkCorpusReader in '/usr/share/nltk_data/corpora/conll2002'>

floresta

Value:

<BracketParseCorpusReader in '/usr/share/nltk_data/corpora/floresta'>

genesis

Value:

<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/genesis'>

gutenberg

Value:

<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/gutenberg'>

hebrew_treebank

Value:

LazyCorpusLoader('hebrew_treebank', BracketParseCorpusReader, r'.*\.tx
t')

inaugural

Value:

<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/inaugural'>

indian

Value:

<IndianCorpusReader in '/usr/share/nltk_data/corpora/indian'>

mac_morpho

Value:

<MacMorphoCorpusReader in '/usr/share/nltk_data/corpora/mac_morpho'>

movie_reviews

Value:

<CategorizedPlaintextCorpusReader in '/usr/share/nltk_data/corpora/mov
ie_reviews'>

names

Value:

<WordListCorpusReader in '/usr/share/nltk_data/corpora/names'>

nps_chat

Value:

<NPSChatCorpusReader in '/usr/share/nltk_data/corpora/nps_chat'>

ppattach

Value:

<PPAttachmentCorpusReader in '/usr/share/nltk_data/corpora/ppattach'>

propbank

Value:

<PropbankCorpusReader in '/usr/share/nltk_data/corpora/propbank'>

qc

Value:

<StringCategoryCorpusReader in '/usr/share/nltk_data/corpora/qc'>

reuters

Value:

<CategorizedPlaintextCorpusReader in '/usr/share/nltk_data/corpora/reu
ters'>

senseval

Value:

<SensevalCorpusReader in '/usr/share/nltk_data/corpora/senseval'>

shakespeare

Value:

<XMLCorpusReader in '/usr/share/nltk_data/corpora/shakespeare'>

sinica_treebank

Value:

<SinicaTreebankCorpusReader in '/usr/share/nltk_data/corpora/sinica_tr
eebank'>

state_union

Value:

<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/state_union'>

stopwords

Value:

<WordListCorpusReader in '/usr/share/nltk_data/corpora/stopwords'>

timit

Value:

<TimitCorpusReader in '/usr/share/nltk_data/corpora/timit'>

toolbox

Value:

<ToolboxCorpusReader in '/usr/share/nltk_data/corpora/toolbox'>

treebank

Value:

<BracketParseCorpusReader in '/usr/share/nltk_data/corpora/treebank/co
mbined'>

treebank_chunk

Value:

<ChunkedCorpusReader in '/usr/share/nltk_data/corpora/treebank/tagged'
>

treebank_raw

Value:

<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/treebank/raw'>

udhr

Value:

<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/udhr'>

verbnet

Value:

<VerbnetCorpusReader in '/usr/share/nltk_data/corpora/verbnet'>

webtext

Value:

<PlaintextCorpusReader in '/usr/share/nltk_data/corpora/webtext'>

words

Value:

<WordListCorpusReader in '/usr/share/nltk_data/corpora/words'>