Package nltk :: Package corpus :: Package reader :: Module plaintext :: Class PlaintextCorpusReader
Class PlaintextCorpusReader

      object --+    
api.CorpusReader --+
Known Subclasses:

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

The corpus view class used by this reader.
__init__(self, root, files, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc...,'tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None)
Construct a new plaintext corpus reader for a set of documents located at the given root directory.
raw(self, files=None)
Returns: the given file or files as a single string.
list of str
words(self, files=None)
Returns: the given file or files as a list of words and punctuation symbols.
list of (list of str)
sents(self, files=None)
Returns: the given file or files as a list of sentences or utterances, each encoded as a list of word strings.
list of (list of (list of str))
paras(self, files=None)
Returns: the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
source code
Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, files, open

Inherited from api.CorpusReader (private): _get_root

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

    Deprecated since 0.8
    Deprecated since 0.9.1

Inherited from api.CorpusReader: filenames

Inherited from api.CorpusReader (private): _get_items

Inherited from api.CorpusReader (private): _encoding, _files, _root

Inherited from api.CorpusReader: root

Inherited from object: __class__

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

