Package nltk :: Package corpus :: Package reader :: Module conll :: Class ConllCorpusReader
[hide private]
[frames] | no frames]

Class ConllCorpusReader

source code

      object --+    
               |    
api.CorpusReader --+
                   |
                  ConllCorpusReader
Known Subclasses:

A corpus reader for CoNLL-style files. These files consist of a series of sentences, seperated by blank lines. Each sentence is encoded using a table (or grid) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the ConllCorpusReader constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus.


To Do:
Instance Methods [hide private]
 
__init__(self, root, files, columntypes, chunk_types=None, top_node='S', pos_in_tree=False, srl_includes_roleset=True, encoding=None, tree_class=<class 'nltk.tree.Tree'>)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
source code
 
raw(self, files=None) source code
 
words(self, files=None) source code
 
sents(self, files=None) source code
 
tagged_words(self, files=None) source code
 
tagged_sents(self, files=None) source code
 
chunked_words(self, files=None, chunk_types=None) source code
 
chunked_sents(self, files=None, chunk_types=None) source code
 
parsed_sents(self, files=None, pos_in_tree=None) source code
 
srl_spans(self, files=None) source code
 
srl_instances(self, files=None, pos_in_tree=None, flatten=True) source code
list of tuple
iob_words(self, files=None)
Returns: a list of word/tag/IOB tuples
source code
list of list
iob_sents(self, files=None)
Returns: a list of lists of word/tag/IOB tuples
source code
 
_grids(self, files=None) source code
 
_read_grid_block(self, stream) source code
 
_get_words(self, grid) source code
 
_get_tagged_words(self, grid) source code
 
_get_iob_words(self, grid) source code
 
_get_chunked_words(self, grid, chunk_types) source code
 
_get_parsed_sent(self, grid, pos_in_tree) source code
 
_get_srl_spans(self, grid)
list of list of (start, end), tag) tuples
source code
 
_get_srl_instances(self, grid, pos_in_tree) source code
 
_require(self, *columntypes) source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, files, open

Inherited from api.CorpusReader (private): _get_root

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

    Deprecated since 0.8
 
read(*args, **kwargs) source code
 
chunked(*args, **kwargs) source code
 
tokenized(*args, **kwargs) source code
 
tagged(*args, **kwargs) source code
    Deprecated since 0.9.1

Inherited from api.CorpusReader: filenames

Inherited from api.CorpusReader (private): _get_items

Static Methods [hide private]
 
_get_column(grid, column_index) source code
Class Variables [hide private]
  WORDS = 'words'
column type for words
  POS = 'pos'
column type for part-of-speech tags
  TREE = 'tree'
column type for parse trees
  CHUNK = 'chunk'
column type for chunk structures
  NE = 'ne'
column type for named entities
  SRL = 'srl'
column type for semantic role labels
  IGNORE = 'ignore'
column type for column that should be ignored
  COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', ...
A list of all column types supported by the conll corpus reader.
Instance Variables [hide private]

Inherited from api.CorpusReader (private): _encoding, _files, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Inherited from object: __class__

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Method Details [hide private]

__init__(self, root, files, columntypes, chunk_types=None, top_node='S', pos_in_tree=False, srl_includes_roleset=True, encoding=None, tree_class=<class 'nltk.tree.Tree'>)
(Constructor)

source code 

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

Parameters:
  • root - A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
  • files - A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader's root to each file name.
  • encoding - The default unicode encoding for the files that make up the corpus. encoding's value can be any of the following:
    • A string: encoding is the encoding name for all files.
    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple's regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
    • None: the file contents of all files will be processed using non-unicode byte strings.
  • tag_mapping_function - A function for normalizing or simplifying the POS tags returned by the tagged_words() or tagged_sents() methods.
Overrides: api.CorpusReader.__init__
(inherited documentation)

iob_words(self, files=None)

source code 
Parameters:
  • files (None or str or list) - the list of files that make up this corpus
Returns: list of tuple
a list of word/tag/IOB tuples

iob_sents(self, files=None)

source code 
Parameters:
  • files (None or str or list) - the list of files that make up this corpus
Returns: list of list
a list of lists of word/tag/IOB tuples

read(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .raw() or .words() or .tagged_words() or " ".chunked_sents() instead.")

Deprecated: Use .raw() or .words() or .tagged_words() or .chunked_sents() instead.

chunked(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .chunked_sents() instead.")

Deprecated: Use .chunked_sents() instead.

tokenized(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .words() instead.")

Deprecated: Use .words() instead.

tagged(*args, **kwargs)

source code 
Decorators:
  • @deprecated("Use .tagged_words() instead.")

Deprecated: Use .tagged_words() instead.


Class Variable Details [hide private]

COLUMN_TYPES

A list of all column types supported by the conll corpus reader.

Value:
('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')