nltk.corpus.reader.util

Module util

Classes

Corpus View

StreamBackedCorpusView
A 'view' of a corpus file, which acts like a sequence of tokens: it can be accessed by index, iterated over, etc.

ConcatenatedCorpusView
A 'view' of a corpus file that joins together one or more StreamBackedCorpusViews.

Corpus View for Pickled Sequences

PickleCorpusView
A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using pickle.dump).

Treebank readers

SyntaxCorpusReader
An abstract base class for reading corpora consisting of syntactically parsed text.

Functions

[hide private]

Corpus View

concat(docs)
Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function.

source code

Block Readers

read_whitespace_block(stream)

source code

read_wordpunct_block(stream)

source code

read_line_block(stream)

source code

read_blankline_block(stream)

source code

read_regexp_block(stream, start_re, end_re=None)
Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. source code

read_sexpr_block(stream, block_size=16384, comment_char=None)
Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read.

source code

_sub_space(m)
Helper function: given a regexp match, return a string of spaces that's the same length as the matched string.

source code

_parse_sexpr_block(block)

source code

Finding Corpus Items

find_corpus_files(root, regexp)

source code

_path_from(parent, child)

source code

Paragraph structure in Treebank files

tagged_treebank_para_block_reader(stream)

source code

Function Details

[hide private]

concat(docs)

source code

Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.

read_regexp_block(stream, start_re, end_re=None)

source code

Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching start_re or EOF is found.

read_sexpr_block(stream, block_size=16384, comment_char=None)

source code

Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.

If the file ends in in the middle of an s-expression, then that incomplete s-expression is returned when the end of the file is reached.

Parameters:

block_size - The default block size for reading. If an s-expression is longer than one block, then more than one block will be read.
comment_char - A character that marks comments. Any lines that begin with this character will be stripped out. (If spaces or tabs preceed the comment character, then the line will not be stripped.)