Package nltk :: Module data :: Class SeekableUnicodeStreamReader
[hide private]
[frames] | no frames]

Class SeekableUnicodeStreamReader

source code

object --+
         |
        SeekableUnicodeStreamReader

A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly. This is in contrast to codecs.StreamReader, which provide *broken* seek() and tell() methods.

This class was motivated by StreamBackedCorpusView, which makes extensive use of seek() and tell(), and needs to be able to handle unicode-encoded files.

Note: this class requires stateless decoders. To my knowledge, this shouldn't cause a problem with any of python's builtin unicode encodings.

Instance Methods [hide private]
 
__init__(self, stream, encoding, errors='strict')
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
source code
unicode
read(self, size=None)
Read up to size bytes, decode them using this reader's encoding, and return the resulting unicode string.
source code
 
readline(self, size=None)
Read a line of text, decode it using this reader's encoding, and return the resulting unicode string.
source code
list of unicode
readlines(self, sizehint=None, keepends=True)
Read this file's contents, decode them using this reader's encoding, and return it as a list of unicode lines.
source code
 
next(self)
Return the next decoded line from the underlying stream.
source code
 
__iter__(self)
Return self
source code
 
xreadlines(self)
Return self
source code
 
close(self)
Close the underlying stream.
source code
 
seek(self, offset, whence=0)
Move the stream to a new file position.
source code
 
char_seek_forward(self, offset)
Move the read pointer forward by offset characters.
source code
 
_char_seek_forward(self, offset, est_bytes=None)
Move the file position forward by offset characters, ignoring all buffers.
source code
 
tell(self)
Return the current file position on the underlying byte stream.
source code
 
_read(self, size=None)
Read up to size bytes from the underlying stream, decode them using this reader's encoding, and return the resulting unicode string.
source code
 
_incr_decode(self, bytes)
Decode the given byte string into a unicode string, using this reader's encoding.
source code
 
_check_bom(self) source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Variables [hide private]
  DEBUG = True
If true, then perform extra sanity checks.
  _BOM_TABLE = {'utf16': [('\xff\xfe', 'utf16-le'), ('\xfe\xff',...
Instance Variables [hide private]
  stream
The underlying stream.
  encoding
The name of the encoding that should be used to encode the underlying stream.
  errors
The error mode that should be used when decoding data from the underlying stream.
  decode
The function that is used to decode byte strings into unicode strings.
  bytebuffer
A buffer to use bytes that have been read but have not yet been decoded.
  linebuffer
A buffer used by readline() to hold characters that have been read, but have not yet been returned by read() or readline().
  _rewind_checkpoint
The file position at which the most recent read on the underlying stream began.
  _rewind_numchars
The number of characters that have been returned since the read that started at _rewind_checkpoint.
  _bom
The length of the byte order marker at the beginning of the stream (or None for no byte order marker).
Properties [hide private]
  closed
True if the underlying stream is closed.
  name
The name of the underlying stream.
  mode
The mode of the underlying stream.

Inherited from object: __class__

Method Details [hide private]

__init__(self, stream, encoding, errors='strict')
(Constructor)

source code 

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

Overrides: object.__init__
(inherited documentation)

read(self, size=None)

source code 

Read up to size bytes, decode them using this reader's encoding, and return the resulting unicode string.

Parameters:
  • size - The maximum number of bytes to read. If not specified, then read as many bytes as possible.
Returns: unicode

readline(self, size=None)

source code 

Read a line of text, decode it using this reader's encoding, and return the resulting unicode string.

Parameters:
  • size - The maximum number of bytes to read. If no newline is encountered before size bytes have been read, then the returned value may not be a complete line of text.

readlines(self, sizehint=None, keepends=True)

source code 

Read this file's contents, decode them using this reader's encoding, and return it as a list of unicode lines.

Parameters:
  • sizehint - Ignored.
  • keepends - If false, then strip newlines.
Returns: list of unicode

seek(self, offset, whence=0)

source code 

Move the stream to a new file position. If the reader is maintaining any buffers, tehn they will be cleared.

Parameters:
  • offset - A byte count offset.
  • whence - If whence is 0, then the offset is from the start of the file (offset should be positive). If whence is 1, then the offset is from the current position (offset may be positive or negative); and if 2, then the offset is from the end of the file (offset should typically be negative).

_char_seek_forward(self, offset, est_bytes=None)

source code 

Move the file position forward by offset characters, ignoring all buffers.

Parameters:
  • est_bytes - A hint, giving an estimate of the number of bytes that will be neded to move foward by offset chars. Defaults to offset.

tell(self)

source code 

Return the current file position on the underlying byte stream. If this reader is maintaining any buffers, then the returned file position will be the position of the beginning of those buffers.

_read(self, size=None)

source code 

Read up to size bytes from the underlying stream, decode them using this reader's encoding, and return the resulting unicode string. linebuffer is *not* included in the result.

_incr_decode(self, bytes)

source code 

Decode the given byte string into a unicode string, using this reader's encoding. If an exception is encountered that appears to be caused by a truncation error, then just decode the byte string without the bytes that cause the trunctaion error.

Returns:
A tuple (chars, num_consumed), where chars is the decoded unicode string, and num_consumed is the number of bytes that were consumed.

Class Variable Details [hide private]

_BOM_TABLE

Value:
{'utf16': [('\xff\xfe', 'utf16-le'), ('\xfe\xff', 'utf16-be')],
 'utf16be': [('\xfe\xff', None)],
 'utf16le': [('\xff\xfe', None)],
 'utf32': [('\xff\xfe\x00\x00', 'utf32-le'),
           ('\x00\x00\xfe\xff', 'utf32-be')],
 'utf32be': [('\x00\x00\xfe\xff', None)],
 'utf32le': [('\xff\xfe\x00\x00', None)],
 'utf8': [('\xef\xbb\xbf', None)]}

Instance Variable Details [hide private]

errors

The error mode that should be used when decoding data from the underlying stream. Can be 'strict', 'ignore', or 'replace'.

bytebuffer

A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.

linebuffer

A buffer used by readline() to hold characters that have been read, but have not yet been returned by read() or readline(). This buffer consists of a list of unicode strings, where each string corresponds to a single line. The final element of the list may or may not be a complete line. Note that the existence of a linebuffer makes the tell() operation more complex, because it must backtrack to the beginning of the buffer to determine the correct file position in the underlying byte stream.

_rewind_checkpoint

The file position at which the most recent read on the underlying stream began. This is used, together with _rewind_numchars, to backtrack to the beginning of linebuffer (which is required by tell()).

_rewind_numchars

The number of characters that have been returned since the read that started at _rewind_checkpoint. This is used, together with _rewind_checkpoint, to backtrack to the beginning of linebuffer (which is required by tell()).


Property Details [hide private]

closed

True if the underlying stream is closed.

Get Method:
unreachable(self)

name

The name of the underlying stream.

Get Method:
unreachable(self)

mode

The mode of the underlying stream.

Get Method:
unreachable(self)