1 Appendix: NLTK Modules and Corpora

NLTK Organization: NLTK is organized into a collection of task-specific packages. Each package is a combination of data structures for representing a particular kind of information such as trees, and implementations of standard algorithms involving those structures such as parsers. This approach is a standard feature of object-oriented design, in which components encapsulate both the resources and methods needed to accomplish a particular task.

The most fundamental NLTK components are for identifying and manipulating individual words of text. These include: tokenize, for breaking up strings of characters into word tokens; tag, for adding part-of-speech tags, including regular-expression taggers, n-gram taggers and Brill taggers; and the Porter stemmer.

The second kind of module is for creating and manipulating structured linguistic information. These components include: tree, for representing and processing parse trees; featurestructure, for building and unifying nested feature structures (or attribute-value matrices); cfg, for specifying context-free grammars; and parse, for creating parse trees over input text, including chart parsers, chunk parsers and probabilistic parsers.

Several utility components are provided to facilitate processing and visualization. These include: draw, to visualize NLP structures and processes; probability, to count and collate events, and perform statistical estimation; and corpora, to access tagged linguistic corpora.

A further group of components is not part of NLTK proper. These are a wide selection of third-party contributions, often developed as student projects at various institutions where NLTK is used, and distributed in a separate package called NLTK Contrib. Several of these student contributions, such as the Brill tagger and the HMM module, have now been incorporated into NLTK. Although these contributed components are not maintained, they may serve as a useful starting point for future student projects.

In addition to software and documentation, NLTK provides substantial corpus samples. Many of these can be accessed using the corpora module, avoiding the need to write specialized file parsing code before you can do NLP tasks. These corpora include: Brown Corpus — 1.15 million words of tagged text in 15 genres; a 10% sample of the Penn Treebank corpus, consisting of 40,000 words of syntactically parsed text; a selection of books from Project Gutenberg totally 1.7 million words; and other corpora for chunking, prepositional phrase attachment, word-sense disambiguation, text categorization, and information extraction.

Table 1.1

Corpora and Corpus Samples Distributed with NLTK
Corpus	Compiler	Contents
Alpino Dutch Treebank	van Noord	140k words, tagged and parsed (Dutch)
Australian ABC News	Bird	2 genres, 660k words, sentence-segmented
Brown Corpus	Francis, Kucera	15 genres, 1.15M words, tagged, categorized
CESS-CAT Catalan Treebank	CLiC-UB et al	500k words, tagged and parsed
CESS-ESP Spanish Treebank	CLiC-UB et al	500k words, tagged and parsed
CMU Pronouncing Dictionary	CMU	127k entries
CoNLL 2000 Chunking Data	Tjong Kim Sang	270k words, tagged and chunked
CoNLL 2002 Named Entity	Tjong Kim Sang	700k words, pos- and named-entity-tagged (Dutch, Spanish)
Floresta Treebank	Diana Santos et al	9k sentences (Portuguese)
Genesis Corpus	Misc web sources	6 texts, 200k words, 6 languages
Gutenberg (sel)	Hart, Newby, et al	14 texts, 1.7M words
Indian POS-Tagged Corpus	Kumaran et al	60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho Corpus	NILC, USP, Brazil	1M words, tagged (Brazilian Portuguese)
Movie Reviews	Pang, Lee	Sentiment Polarity Dataset 2.0
Names Corpus	Kantrowitz, Ross	8k male and female names
NIST 1999 Info Extr (sel)	Garofolo	63k words, newswire and named-entity SGML markup
NPS Chat Corpus	Forsyth, Martell	10k IM chat posts, POS-tagged and dialogue-act tagged
PP Attachment Corpus	Ratnaparkhi	28k prepositional phrases, tagged as noun or verb modifiers
Presidential Addresses	Ahrens	485k words, formatted text
Proposition Bank	Palmer	113k propositions, 3300 verb frames
Question Classification	Li, Roth	6k questions, categorized
Reuters Corpus	Reuters	1.3M words, 10k news documents, categorized
Roget's Thesaurus	Project Gutenberg	200k words, formatted text
RTE Textual Entailment	Dagan et al	8k sentence pairs, categorized
SEMCOR	Rus, Mihalcea	880k words, part-of-speech and sense tagged
SENSEVAL 2 Corpus	Ted Pedersen	600k words, part-of-speech and sense tagged
Shakespeare XML texts (sel)	Jon Bosak	8 books
Stopwords Corpus	Porter et al	2,400 stopwords for 11 languages
Switchboard Corpus (sel)	LDC	36 phonecalls, transcribed, parsed
Univ Decl of Human Rights		480k words, 300+ languages
US Pres Addr Corpus	Ahrens	480k words
Penn Treebank (sel)	LDC	40k words, tagged and parsed
TIMIT Corpus (sel)	NIST/LDC	audio files and transcripts for 16 speakers
VerbNet 2.1	Palmer et al	5k verbs, hierarchically organized, linked to WordNet
Wordlist Corpus	OpenOffice.org et al	960k words and 20k affixes for 8 languages
WordNet 3.0 (English)	Miller, Fellbaum	145k synonym sets

About this document...

This chapter is a draft from Natural Language Processing [http://nltk.org/book.html], by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [http://nltk.org/], Version 0.9.5, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [http://creativecommons.org/licenses/by-nc-nd/3.0/us/].

This document is