Reader for chunked (and optionally tagged) corpora. Paragraphs are
split using a block reader. They are then tokenized into sentences using
a sentence tokenizer. Finally, these sentences are parsed into chunk
trees using a string-to-chunktree conversion function. Each of these
steps can be performed using a default function or a custom function. By
default, paragraphs are split on blank lines; sentences are listed one
per line; and sentences are parsed into chunk trees using chunk.tagstr2tree.
|
__init__(self,
root,
files,
extension='
' ,
str2chunktree=<function tagstr2tree at 0x154bc70>,
sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f... ,
para_block_reader=<function read_blankline_block at 0x1575470>,
encoding=None)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature |
source code
|
|
str
|
raw(self,
files=None)
Returns:
the given file or files as a single string. |
source code
|
|
list of str
|
words(self,
files=None)
Returns:
the given file or files as a list of words and punctuation symbols. |
source code
|
|
list of (list of str )
|
sents(self,
files=None)
Returns:
the given file or files as a list of sentences or utterances, each
encoded as a list of word strings. |
source code
|
|
list of (list of (list of
str ))
|
paras(self,
files=None)
Returns:
the given file or files as a list of paragraphs, each encoded as a
list of sentences, which are in turn encoded as lists of word
strings. |
source code
|
|
list of (str,str)
|
tagged_words(self,
files=None)
Returns:
the given file or files as a list of tagged words and punctuation
symbols, encoded as tuples (word,tag) . |
source code
|
|
list of (list of (str,str) )
|
tagged_sents(self,
files=None)
Returns:
the given file or files as a list of sentences, each encoded as a
list of (word,tag) tuples. |
source code
|
|
list of (list of (list of
(str,str) ))
|
tagged_paras(self,
files=None)
Returns:
the given file or files as a list of paragraphs, each encoded as a
list of sentences, which are in turn encoded as lists of
(word,tag) tuples. |
source code
|
|
list of ((str,str) and Tree)
|
|
list of Tree
|
chunked_sents(self,
files=None)
Returns:
the given file or file as a list of sentences, each encoded as a
shallow Tree . |
source code
|
|
list of (list of Tree)
|
chunked_paras(self,
files=None)
Returns:
the given file or files as a list of paragraphs, each encoded as a
list of sentences, which are in turn encoded as a shallow
Tree . |
source code
|
|
|
|
Inherited from api.CorpusReader :
__repr__ ,
abspath ,
abspaths ,
encoding ,
files ,
open
Inherited from object :
__delattr__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__setattr__ ,
__str__
|
Inherited from api.CorpusReader :
filenames
|