Reader for corpora that consist of plaintext documents. Paragraphs
are assumed to be split using blank lines. Sentences and words can be
tokenized using the default tokenizers, or by custom tokenizers
specificed as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface sections
of specific document formats) by creating a subclass and overriding the
CorpusView class variable.
|
__init__(self,
root,
files,
word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, disc... ,
sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle'),
para_block_reader=<function read_blankline_block at 0x1575470>,
encoding=None)
Construct a new plaintext corpus reader for a set of documents
located at the given root directory. |
source code
|
|
str
|
raw(self,
files=None)
Returns:
the given file or files as a single string. |
source code
|
|
list of str
|
words(self,
files=None)
Returns:
the given file or files as a list of words and punctuation symbols. |
source code
|
|
list of (list of str )
|
sents(self,
files=None)
Returns:
the given file or files as a list of sentences or utterances, each
encoded as a list of word strings. |
source code
|
|
list of (list of (list of
str ))
|
paras(self,
files=None)
Returns:
the given file or files as a list of paragraphs, each encoded as a
list of sentences, which are in turn encoded as lists of word
strings. |
source code
|
|
|
|
|
|
|
|
Inherited from api.CorpusReader :
__repr__ ,
abspath ,
abspaths ,
encoding ,
files ,
open
Inherited from object :
__delattr__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__setattr__ ,
__str__
|
|
|
|
|
Inherited from api.CorpusReader :
filenames
|