Package nltk :: Package corpus :: Package reader :: Module plaintext :: Class PlaintextCorpusReader

Class PlaintextCorpusReader

      object --+    
               |    
api.CorpusReader --+
                   |
                  PlaintextCorpusReader

Known Subclasses:

CategorizedPlaintextCorpusReader

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

Nested Classes

[hide private]

CorpusView
The corpus view class used by this reader.

Instance Methods

[hide private]

__init__(self, root, files, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc..., sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None)
Construct a new plaintext corpus reader for a set of documents located at the given root directory. source code

str

raw(self, files=None)
Returns: the given file or files as a single string.

source code

list of str

words(self, files=None)
Returns: the given file or files as a list of words and punctuation symbols.

source code

list of (list of str)

sents(self, files=None)
Returns: the given file or files as a list of sentences or utterances, each encoded as a list of word strings.

source code

list of (list of (list of str))

paras(self, files=None)
Returns: the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

source code

_read_word_block(self, stream)

source code

_read_sent_block(self, stream)

source code

_read_para_block(self, stream)

source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, files, open

Inherited from api.CorpusReader (private): _get_root

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Deprecated since 0.8

read(*args, **kwargs)

source code

tokenized(*args, **kwargs)

source code

Deprecated since 0.9.1

Inherited from api.CorpusReader: filenames

Inherited from api.CorpusReader (private): _get_items

Instance Variables

[hide private]

Inherited from api.CorpusReader (private): _encoding, _files, _root

Properties

[hide private]

Inherited from api.CorpusReader: root

Inherited from object: __class__

Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Method Details

[hide private]

init(self, root, files, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc`...`, sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None)
(Constructor)

source code

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = PlaintextCorpusReader(root, '.*', '.txt')

Parameters:

root - The root directory for this corpus.
files - A list or regexp specifying the files in this corpus.
word_tokenizer - Tokenizer for breaking sentences or paragraphs into words.
sent_tokenizer - Tokenizer for breaking paragraphs into words.
para_block_reader - The block reader used to divide the corpus into paragraph blocks.

Overrides: api.CorpusReader.__init__

raw(self, files=None)

source code

Returns: str: the given file or files as a single string.

words(self, files=None)

source code

Returns: list of str: the given file or files as a list of words and punctuation symbols.

sents(self, files=None)

source code

Returns: list of (list of str): the given file or files as a list of sentences or utterances, each encoded as a list of word strings.

paras(self, files=None)

source code

Returns: list of (list of (list of str)): the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

read(*args, **kwargs)

source code

Decorators:

@deprecated("Use .raw() or .words() instead.")

Deprecated: Use .raw() or .words() instead.

tokenized(*args, **kwargs)

source code

Decorators:

@deprecated("Use .words() instead.")

Deprecated: Use .words() instead.

Class PlaintextCorpusReader

__init__(self, root, files, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc..., sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None) (Constructor)

raw(self, files=None)

words(self, files=None)

sents(self, files=None)

paras(self, files=None)

read(*args, **kwargs)

tokenized(*args, **kwargs)

init(self, root, files, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc`...`, sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None)
(Constructor)