Package nltk :: Package corpus :: Package reader :: Module plaintext :: Class PlaintextCorpusReader
[hide private]
[frames] | no frames]

Class PlaintextCorpusReader

source code

      object --+    
               |    
api.CorpusReader --+
                   |
                  PlaintextCorpusReader
Known Subclasses:

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

Nested Classes [hide private]
  CorpusView
The corpus view class used by this reader.
Instance Methods [hide private]
 
__init__(self, root, files, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc..., sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None)
Construct a new plaintext corpus reader for a set of documents located at the given root directory.
source code
str
raw(self, files=None)
Returns: the given file or files as a single string.
source code
list of str
words(self, files=None)
Returns: the given file or files as a list of words and punctuation symbols.
source code
list of (list of str)
sents(self, files=None)
Returns: the given file or files as a list of sentences or utterances, each encoded as a list of word strings.
source code
list of (list of (list of str))
paras(self, files=None)
Returns: the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
source code
 
_read_word_block(self, stream) source code
 
_read_sent_block(self, stream) source code
 
_read_para_block(self, stream) source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, files, open

Inherited from api.CorpusReader (private): _get_root

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

    Deprecated since 0.8
 
read(*args, **kwargs) source code
 
tokenized(*args, **kwargs) source code
    Deprecated since 0.9.1

Inherited from api.CorpusReader: filenames

Inherited from api.CorpusReader (private): _get_items

Instance Variables [hide private]

Inherited from api.CorpusReader (private): _encoding, _files, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Inherited from object: __class__

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Method Details [hide private]

__init__(self, root, files, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc..., sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None)
(Constructor)

source code 

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = PlaintextCorpusReader(root, '.*', '.txt')
Parameters:
  • root - The root directory for this corpus.
  • files - A list or regexp specifying the files in this corpus.
  • word_tokenizer - Tokenizer for breaking sentences or paragraphs into words.
  • sent_tokenizer - Tokenizer for breaking paragraphs into words.
  • para_block_reader - The block reader used to divide the corpus into paragraph blocks.
Overrides: api.CorpusReader.__init__

raw(self, files=None)

source code 
Returns: str
the given file or files as a single string.

words(self, files=None)

source code 
Returns: list of str
the given file or files as a list of words and punctuation symbols.

sents(self, files=None)

source code 
Returns: list of (list of str)
the given file or files as a list of sentences or utterances, each encoded as a list of word strings.

paras(self, files=None)

source code 
Returns: list of (list of (list of str))
the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

read(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .raw() or .words() instead.")

Deprecated: Use .raw() or .words() instead.

tokenized(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .words() instead.")

Deprecated: Use .words() instead.