Package nltk :: Package corpus :: Package reader :: Module util
[hide private]
[frames] | no frames]

Module util

source code

Classes [hide private]
    Corpus View
  StreamBackedCorpusView
A 'view' of a corpus file, which acts like a sequence of tokens: it can be accessed by index, iterated over, etc.
  ConcatenatedCorpusView
A 'view' of a corpus file that joins together one or more StreamBackedCorpusViews.
    Corpus View for Pickled Sequences
  PickleCorpusView
A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using pickle.dump).
    Treebank readers
  SyntaxCorpusReader
An abstract base class for reading corpora consisting of syntactically parsed text.
Functions [hide private]
    Corpus View
 
concat(docs)
Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function.
source code
    Block Readers
 
read_whitespace_block(stream) source code
 
read_wordpunct_block(stream) source code
 
read_line_block(stream) source code
 
read_blankline_block(stream) source code
 
read_regexp_block(stream, start_re, end_re=None)
Read a sequence of tokens from a stream, where tokens begin with lines that match start_re.
source code
 
read_sexpr_block(stream, block_size=16384, comment_char=None)
Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read.
source code
 
_sub_space(m)
Helper function: given a regexp match, return a string of spaces that's the same length as the matched string.
source code
 
_parse_sexpr_block(block) source code
    Finding Corpus Items
 
find_corpus_files(root, regexp) source code
 
_path_from(parent, child) source code
    Paragraph structure in Treebank files
 
tagged_treebank_para_block_reader(stream) source code
Function Details [hide private]

concat(docs)

source code 

Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.

read_regexp_block(stream, start_re, end_re=None)

source code 

Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching start_re or EOF is found.

read_sexpr_block(stream, block_size=16384, comment_char=None)

source code 

Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.

If the file ends in in the middle of an s-expression, then that incomplete s-expression is returned when the end of the file is reached.

Parameters:
  • block_size - The default block size for reading. If an s-expression is longer than one block, then more than one block will be read.
  • comment_char - A character that marks comments. Any lines that begin with this character will be stripped out. (If spaces or tabs preceed the comment character, then the line will not be stripped.)