nltk.corpus.reader.conll.ConllCorpusReader

A corpus reader for CoNLL-style files. These files consist of a series of sentences, seperated by blank lines. Each sentence is encoded using a table (or grid) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the ConllCorpusReader constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus.

init(self, root, files, columntypes, chunk_types=None, top_node=`'S'`, pos_in_tree=False, srl_includes_roleset=True, encoding=None, tree_class=<class 'nltk.tree.Tree'>)
(Constructor)

source code

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

Parameters:

root - A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
files - A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader's root to each file name.
encoding - The default unicode encoding for the files that make up the corpus. encoding's value can be any of the following:
- A string: encoding is the encoding name for all files.
- A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
- A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple's regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
- None: the file contents of all files will be processed using non-unicode byte strings.
tag_mapping_function - A function for normalizing or simplifying the POS tags returned by the tagged_words() or tagged_sents() methods.

Overrides: api.CorpusReader.__init__

(inherited documentation)

iob_sents(self, files=None)

source code

Parameters:

files (None or str or list) - the list of files that make up this corpus

Returns: list of list

a list of lists of word/tag/IOB tuples

COLUMN_TYPES

A list of all column types supported by the conll corpus reader.

Value:

('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')

Class ConllCorpusReader

init(self, root, files, columntypes, chunk_types=None, top_node=`'S'`, pos_in_tree=False, srl_includes_roleset=True, encoding=None, tree_class=<class 'nltk.tree.Tree'>)
(Constructor)

iob_words(self, files=None)

iob_sents(self, files=None)

read(*args, **kwargs)

chunked(*args, **kwargs)

tokenized(*args, **kwargs)

tagged(*args, **kwargs)

COLUMN_TYPES

Class ConllCorpusReader

__init__(self, root, files, columntypes, chunk_types=None, top_node='S', pos_in_tree=False, srl_includes_roleset=True, encoding=None, tree_class=<class 'nltk.tree.Tree'>) (Constructor)

iob_words(self, files=None)

iob_sents(self, files=None)

read(*args, **kwargs)

chunked(*args, **kwargs)

tokenized(*args, **kwargs)

tagged(*args, **kwargs)

COLUMN_TYPES

init(self, root, files, columntypes, chunk_types=None, top_node=`'S'`, pos_in_tree=False, srl_includes_roleset=True, encoding=None, tree_class=<class 'nltk.tree.Tree'>)
(Constructor)