Package nltk :: Package corpus :: Package reader :: Module xmldocs :: Class XMLCorpusView

Class XMLCorpusView

               object --+        
                        |        
util.AbstractLazySequence --+    
                            |    
  util.StreamBackedCorpusView --+
                                |
                               XMLCorpusView

Known Subclasses:

bnc.BNCWordView

A corpus view that selects out specified elements from an XML file, and provides a flat list-like interface for accessing them. (Note: XMLCorpusView is not used by XMLCorpusReader itself, but may be used by subclasses of XMLCorpusReader.)

Every XML corpus view has a tag specification, indicating what XML elements should be included in the view; and each (non-nested) element that matches this specification corresponds to one item in the view. Tag specifications are regular expressions over tag paths, where a tag path is a list of element tag names, sepaated by '/', indicating the ancestry of the element. Some examples:

'foo': A top-level element whose tag is foo.
'foo/bar': An element whose tag is bar and whose parent is a top-level element whose tag is foo.
'.*/foo': An element whose tag is foo, appearing anywhere in the xml tree.
'.*/(foo|bar)': An wlement whose tag is foo or bar, appearing anywhere in the xml tree.

The view items are generated from the selected XML elements via the method handle_elt(). By default, this method returns the element as-is (i.e., as an ElementTree object); but it can be overridden, either via subclassing or via the elt_handler constructor parameter.

Instance Methods

[hide private]

__init__(self, filename, tagspec, elt_handler=None)
Create a new corpus view based on a specified XML file.

source code

_detect_encoding(self, filename)

source code

handle_elt(self, elt, context)
Convert an element into an appropriate value for inclusion in the view.

source code

_read_xml_fragment(self, stream)
Read a string from the given stream that does not contain any un-closed tags.

source code

list of any

read_block(self, stream, tagspec=None, elt_handler=None)
Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found. source code

Inherited from util.StreamBackedCorpusView: __add__, __getitem__, __len__, __mul__, __radd__, __rmul__, close, iterate_from

Inherited from util.StreamBackedCorpusView (private): _open

Inherited from util.AbstractLazySequence: __cmp__, __contains__, __hash__, __iter__, __repr__, count, index

Inherited from object: __delattr__, __getattribute__, __new__, __reduce__, __reduce_ex__, __setattr__, __str__

Class Variables

[hide private]

_DEBUG = False
If true, then display debugging output to stdout when reading blocks.

_BLOCK_SIZE = 1024
The number of characters read at a time by this corpus reader.

_VALID_XML_RE = re.compile(r'(?sx)[^<]*((()|(<![CDAT...
A regular expression that matches XML fragments that do not contain any un-closed tags.

_XML_TAG_NAME = re.compile(r'<\s*/?\s*([^\s>]+)')
A regular expression used to extract the tag name from a start tag, end tag, or empty-elt tag string.

_XML_PIECE = re.compile(r'(?sx)(?P<COMMENT>)|(?P<CDA...
A regular expression used to find all start-tags, end-tags, and emtpy-elt tags in an XML file.

Inherited from util.AbstractLazySequence (private): _MAX_REPR_SIZE

Instance Variables

[hide private]

_tagspec
The tag specification for this corpus view.

_tag_context
A dictionary mapping from file positions (as returned by stream.seek() to XML contexts.

Inherited from util.StreamBackedCorpusView (private): _block_reader, _cache, _current_blocknum, _current_toknum, _eofpos, _filepos, _len, _stream, _toknum

Properties

[hide private]

Inherited from util.StreamBackedCorpusView: filename

Inherited from object: __class__

Method Details

[hide private]

init(self, filename, tagspec, elt_handler=None)
(Constructor)

source code

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters:

tagspec (str) - A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler - A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:
```
   elt_handler(elt, tagspec) -> value
```

Overrides: util.StreamBackedCorpusView.__init__

handle_elt(self, elt, context)

source code

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Parameters:

elt (ElementTree) - The element that should be converted.
context (str) - A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.

Returns:

The view value corresponding to elt.

_read_xml_fragment(self, stream)

source code

Read a string from the given stream that does not contain any un-closed tags. In particular, this function first reads a block from the stream of size self._BLOCK_SIZE. It then checks if that block contains an un-closed tag. If it does, then this function either backtracks to the last '<', or reads another block.

read_block(self, stream, tagspec=None, elt_handler=None)

source code

Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found.

Parameters:

stream - an input stream

Returns: list of any

a block of tokens from the input stream

Overrides: util.StreamBackedCorpusView.read_block

Class Variable Details

[hide private]

_VALID_XML_RE

A regular expression that matches XML fragments that do not contain any un-closed tags.

Value:

re.compile(r'(?sx)[^<]*(((<!--.*?-->)|(<![CDATA\[\.\*\?]\])|(<!DOCTYPE
\s+[^\[]*(\[[^\]]*\])?\s*>)|(<[^>]*>))[^<]*)*\Z')

_XML_PIECE

A regular expression used to find all start-tags, end-tags, and emtpy-elt tags in an XML file. This regexp is more lenient than the XML spec -- e.g., it allows spaces in some places where the spec does not.

Value:

re.compile(r'(?sx)(?P<COMMENT><!--.*?-->)|(?P<CDATA><![CDATA\[\.\*\?]\
]>)|(?P<PI><\?.*?\?>)|(?P<DOCTYPE><!DOCTYPE\s+[^\[]*(\[[^\]]*\])?\s*>)
|(?P<EMPTY_ELT_TAG><\s*[^>/\?!\s][^>]*/\s*>)|(?P<START_TAG><\s*[^>/\?!
\s][^>]*>)|(?P<END_TAG><\s*/[^>/\?!\s][^>]*>)')

Instance Variable Details

[hide private]

_tag_context

A dictionary mapping from file positions (as returned by stream.seek() to XML contexts. An XML context is a tuple of XML tag names, indicating which tags have not yet been closed.

Class XMLCorpusView

__init__(self, filename, tagspec, elt_handler=None) (Constructor)

handle_elt(self, elt, context)

_read_xml_fragment(self, stream)

read_block(self, stream, tagspec=None, elt_handler=None)

_VALID_XML_RE

_XML_PIECE

_tag_context

init(self, filename, tagspec, elt_handler=None)
(Constructor)