Reader for simple part-of-speech tagged corpora. Paragraphs are
assumed to be split using blank lines. Sentences and words can be
tokenized using the default tokenizers, or by custom tokenizers specified
as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/'
is
used as the separator. I.e., words should have the form:
But custom separators may be specified as parameters to the
constructor. Part of speech tags are case-normalized to upper case.
|
__init__(self,
root,
files,
sep=' / ' ,
word_tokenizer=WhitespaceTokenizer(),
sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f... ,
para_block_reader=<function read_blankline_block at 0x1575470>,
encoding=None,
tag_mapping_function=None)
Construct a new Tagged Corpus reader for a set of documents located
at the given root directory. |
source code
|
|
str
|
raw(self,
files=None)
Returns:
the given file or files as a single string. |
source code
|
|
list of str
|
words(self,
files=None)
Returns:
the given file or files as a list of words and punctuation symbols. |
source code
|
|
list of (list of str )
|
sents(self,
files=None)
Returns:
the given file or files as a list of sentences or utterances, each
encoded as a list of word strings. |
source code
|
|
list of (list of (list of
str ))
|
paras(self,
files=None)
Returns:
the given file or files as a list of paragraphs, each encoded as a
list of sentences, which are in turn encoded as lists of word
strings. |
source code
|
|
list of (str,str)
|
tagged_words(self,
files=None,
simplify_tags=False)
Returns:
the given file or files as a list of tagged words and punctuation
symbols, encoded as tuples (word,tag) . |
source code
|
|
list of (list of (str,str) )
|
tagged_sents(self,
files=None,
simplify_tags=False)
Returns:
the given file or files as a list of sentences, each encoded as a
list of (word,tag) tuples. |
source code
|
|
list of (list of (list of
(str,str) ))
|
tagged_paras(self,
files=None,
simplify_tags=False)
Returns:
the given file or files as a list of paragraphs, each encoded as a
list of sentences, which are in turn encoded as lists of
(word,tag) tuples. |
source code
|
|
Inherited from api.CorpusReader :
__repr__ ,
abspath ,
abspaths ,
encoding ,
files ,
open
Inherited from object :
__delattr__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__setattr__ ,
__str__
|
|
|
|
|
|
|
Inherited from api.CorpusReader :
filenames
|