Package nltk :: Package corpus :: Package reader :: Module tagged :: Class TaggedCorpusReader
[hide private]
[frames] | no frames]

Class TaggedCorpusReader

source code

      object --+    
               |    
api.CorpusReader --+
                   |
                  TaggedCorpusReader
Known Subclasses:

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form:

  word1/tag1 word2/tag2 word3/tag3 ...

But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.

Instance Methods [hide private]
 
__init__(self, root, files, sep='/', word_tokenizer=WhitespaceTokenizer(), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f..., para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None, tag_mapping_function=None)
Construct a new Tagged Corpus reader for a set of documents located at the given root directory.
source code
str
raw(self, files=None)
Returns: the given file or files as a single string.
source code
list of str
words(self, files=None)
Returns: the given file or files as a list of words and punctuation symbols.
source code
list of (list of str)
sents(self, files=None)
Returns: the given file or files as a list of sentences or utterances, each encoded as a list of word strings.
source code
list of (list of (list of str))
paras(self, files=None)
Returns: the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
source code
list of (str,str)
tagged_words(self, files=None, simplify_tags=False)
Returns: the given file or files as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
source code
list of (list of (str,str))
tagged_sents(self, files=None, simplify_tags=False)
Returns: the given file or files as a list of sentences, each encoded as a list of (word,tag) tuples.
source code
list of (list of (list of (str,str)))
tagged_paras(self, files=None, simplify_tags=False)
Returns: the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, files, open

Inherited from api.CorpusReader (private): _get_root

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

    Deprecated since 0.8
 
read(*args, **kwargs) source code
 
tokenized(*args, **kwargs) source code
 
tagged(*args, **kwargs) source code
    Deprecated since 0.9.1

Inherited from api.CorpusReader: filenames

Inherited from api.CorpusReader (private): _get_items

Instance Variables [hide private]

Inherited from api.CorpusReader (private): _encoding, _files, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Inherited from object: __class__

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Method Details [hide private]

__init__(self, root, files, sep='/', word_tokenizer=WhitespaceTokenizer(), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f..., para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None, tag_mapping_function=None)
(Constructor)

source code 

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt')
Parameters:
  • root - The root directory for this corpus.
  • files - A list or regexp specifying the files in this corpus.
Overrides: api.CorpusReader.__init__

raw(self, files=None)

source code 
Returns: str
the given file or files as a single string.

words(self, files=None)

source code 
Returns: list of str
the given file or files as a list of words and punctuation symbols.

sents(self, files=None)

source code 
Returns: list of (list of str)
the given file or files as a list of sentences or utterances, each encoded as a list of word strings.

paras(self, files=None)

source code 
Returns: list of (list of (list of str))
the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

tagged_words(self, files=None, simplify_tags=False)

source code 
Returns: list of (str,str)
the given file or files as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

tagged_sents(self, files=None, simplify_tags=False)

source code 
Returns: list of (list of (str,str))
the given file or files as a list of sentences, each encoded as a list of (word,tag) tuples.

tagged_paras(self, files=None, simplify_tags=False)

source code 
Returns: list of (list of (list of (str,str)))
the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

read(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .raw() or .words() or .sents() or .paras() or " ".tagged_words() or .tagged_sents() or .tagged_paras() " "instead.")

Deprecated: Use .raw() or .words() or .sents() or .paras() or .tagged_words() or .tagged_sents() or .tagged_paras() instead.

tokenized(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .words() or .sents() or .paras() instead.")

Deprecated: Use .words() or .sents() or .paras() instead.

tagged(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .tagged_words() or .tagged_sents() or " ".tagged_paras() instead.")

Deprecated: Use .tagged_words() or .tagged_sents() or .tagged_paras() instead.