Package nltk :: Package corpus :: Package reader :: Module toolbox :: Class StandardFormat
Class StandardFormat

object --+
Known Subclasses:

Class for reading and processing standard format marker files and strings.

__init__(self, filename=None, encoding=None)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
open(self, sfm_file)
Open a standard format marker file for sequential reading.
open_string(self, s)
Open a standard format marker string for sequential reading.
iterator over (marker, value) tuples
Return an iterator for the fields in the standard format marker file.
iterator over (marker, value) tuples
fields(self, strip=True, unwrap=True, encoding=None, errors='strict', unicode_fields=None)
Return an iterator for the fields in the standard format marker file.
Close a previously opened standard format marker file or string.
__init__(self, filename=None, encoding=None)

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

open(self, sfm_file)

Open a standard format marker file for sequential reading.

  • sfm_file (string) - name of the standard format marker input file

open_string(self, s)

Open a standard format marker string for sequential reading.

  • s (string) - string to parse as a standard format marker input file


Return an iterator for the fields in the standard format marker file.

Returns: iterator over (marker, value) tuples
an iterator that returns the next field in a (marker, value) tuple. Linebreaks and trailing white space are preserved except for the final newline in each field.

fields(self, strip=True, unwrap=True, encoding=None, errors='strict', unicode_fields=None)

Return an iterator for the fields in the standard format marker file.

  • strip (boolean) - strip trailing whitespace from the last line of each field
  • unwrap (boolean) - Convert newlines in a field to spaces.
  • encoding (string or None) - Name of an encoding to use. If it is specified then the fields method returns unicode strings rather than non unicode strings.
  • errors (string) - Error handling scheme for codec. Same as the decode inbuilt string method.
  • unicode_fields (set or dictionary (actually any sequence that supports the 'in' operator).) - Set of marker names whose values are UTF-8 encoded. Ignored if encoding is None. If the whole file is UTF-8 encoded set encoding='utf8' and leave unicode_fields with its default value of None.
Returns: iterator over (marker, value) tuples
an iterator that returns the next field in a (marker, value) tuple. marker and value are unicode strings if an encoding was specified in the fields method. Otherwise they are nonunicode strings.