Package nltk :: Package corpus :: Package reader :: Module tagged :: Class TaggedCorpusReader

Class TaggedCorpusReader

      object --+    
               |    
api.CorpusReader --+
                   |
                  TaggedCorpusReader

Known Subclasses:

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form:

  word1/tag1 word2/tag2 word3/tag3 ...

But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.

Instance Methods

[hide private]

__init__(self, root, files, sep='/', word_tokenizer=WhitespaceTokenizer(), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f..., para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None, tag_mapping_function=None)
Construct a new Tagged Corpus reader for a set of documents located at the given root directory. source code

str

raw(self, files=None)
Returns: the given file or files as a single string.

source code

list of str

words(self, files=None)
Returns: the given file or files as a list of words and punctuation symbols.

source code

list of (list of str)

sents(self, files=None)
Returns: the given file or files as a list of sentences or utterances, each encoded as a list of word strings.

source code

list of (list of (list of str))

paras(self, files=None)
Returns: the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

source code

list of (str,str)

tagged_words(self, files=None, simplify_tags=False)
Returns: the given file or files as a list of tagged words and punctuation symbols, encoded as tuples (word,tag). source code

list of (list of (str,str))

tagged_sents(self, files=None, simplify_tags=False)
Returns: the given file or files as a list of sentences, each encoded as a list of (word,tag) tuples. source code

list of (list of (list of (str,str)))

tagged_paras(self, files=None, simplify_tags=False)
Returns: the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples. source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, files, open

Inherited from api.CorpusReader (private): _get_root

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Deprecated since 0.8

read(*args, **kwargs)

source code

tokenized(*args, **kwargs)

source code

tagged(*args, **kwargs)

source code

Deprecated since 0.9.1

Inherited from api.CorpusReader: filenames

Inherited from api.CorpusReader (private): _get_items

Instance Variables

[hide private]

Inherited from api.CorpusReader (private): _encoding, _files, _root

Properties

[hide private]

Inherited from api.CorpusReader: root

Inherited from object: __class__

Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Method Details

[hide private]

init(self, root, files, sep=`'/'`, word_tokenizer=WhitespaceTokenizer(), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f`...`, para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None, tag_mapping_function=None)
(Constructor)

source code

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt')

Parameters:

root - The root directory for this corpus.
files - A list or regexp specifying the files in this corpus.

Overrides: api.CorpusReader.__init__

raw(self, files=None)

source code

Returns: str: the given file or files as a single string.

words(self, files=None)

source code

Returns: list of str: the given file or files as a list of words and punctuation symbols.

sents(self, files=None)

source code

Returns: list of (list of str): the given file or files as a list of sentences or utterances, each encoded as a list of word strings.

paras(self, files=None)

source code

Returns: list of (list of (list of str)): the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

tagged_words(self, files=None, simplify_tags=False)

source code

Returns: list of (str,str): the given file or files as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

tagged_sents(self, files=None, simplify_tags=False)

source code

Returns: list of (list of (str,str)): the given file or files as a list of sentences, each encoded as a list of (word,tag) tuples.

tagged_paras(self, files=None, simplify_tags=False)

source code

Returns: list of (list of (list of (str,str))): the given file or files as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

read(*args, **kwargs)

source code

Decorators:

@deprecated("Use .raw() or .words() or .sents() or .paras() or " ".tagged_words() or .tagged_sents() or .tagged_paras() " "instead.")

Deprecated: Use .raw() or .words() or .sents() or .paras() or .tagged_words() or .tagged_sents() or .tagged_paras() instead.

tokenized(*args, **kwargs)

source code

Decorators:

@deprecated("Use .words() or .sents() or .paras() instead.")

Deprecated: Use .words() or .sents() or .paras() instead.

tagged(*args, **kwargs)

source code

Decorators:

@deprecated("Use .tagged_words() or .tagged_sents() or " ".tagged_paras() instead.")

Deprecated: Use .tagged_words() or .tagged_sents() or .tagged_paras() instead.

Class TaggedCorpusReader

__init__(self, root, files, sep='/', word_tokenizer=WhitespaceTokenizer(), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f..., para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None, tag_mapping_function=None) (Constructor)

raw(self, files=None)

words(self, files=None)

sents(self, files=None)

paras(self, files=None)

tagged_words(self, files=None, simplify_tags=False)

tagged_sents(self, files=None, simplify_tags=False)

tagged_paras(self, files=None, simplify_tags=False)

read(*args, **kwargs)

tokenized(*args, **kwargs)

tagged(*args, **kwargs)

init(self, root, files, sep=`'/'`, word_tokenizer=WhitespaceTokenizer(), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, f`...`, para_block_reader=<function read_blankline_block at 0x1575470>, encoding=None, tag_mapping_function=None)
(Constructor)