Class XMLCorpusView
source code
object --+
|
util.AbstractLazySequence --+
|
util.StreamBackedCorpusView --+
|
XMLCorpusView
- Known Subclasses:
-
A corpus view that selects out specified elements from an XML file,
and provides a flat list-like interface for accessing them. (Note:
XMLCorpusView is not used by XMLCorpusReader itself, but may be used by subclasses of
XMLCorpusReader.)
Every XML corpus view has a tag specification, indicating what XML elements
should be included in the view; and each (non-nested) element that
matches this specification corresponds to one item in the view. Tag
specifications are regular expressions over tag paths, where a tag path
is a list of element tag names, sepaated by '/', indicating the ancestry
of the element. Some examples:
-
'foo': A top-level element whose tag is
foo.
-
'foo/bar': An element whose tag is bar and
whose parent is a top-level element whose tag is foo.
-
'.*/foo': An element whose tag is foo,
appearing anywhere in the xml tree.
-
'.*/(foo|bar)': An wlement whose tag is foo
or bar, appearing anywhere in the xml tree.
The view items are generated from the selected XML elements via the
method handle_elt(). By default, this method returns the
element as-is (i.e., as an ElementTree object); but it can be overridden,
either via subclassing or via the elt_handler constructor
parameter.
|
|
__init__(self,
filename,
tagspec,
elt_handler=None)
Create a new corpus view based on a specified XML file. |
source code
|
|
|
|
|
|
|
handle_elt(self,
elt,
context)
Convert an element into an appropriate value for inclusion in the
view. |
source code
|
|
|
|
|
|
list of any
|
read_block(self,
stream,
tagspec=None,
elt_handler=None)
Read from stream until we find at least one element that
matches tagspec, and return the result of applying
elt_handler to each element found. |
source code
|
|
|
Inherited from util.StreamBackedCorpusView:
__add__,
__getitem__,
__len__,
__mul__,
__radd__,
__rmul__,
close,
iterate_from
Inherited from util.AbstractLazySequence:
__cmp__,
__contains__,
__hash__,
__iter__,
__repr__,
count,
index
Inherited from object:
__delattr__,
__getattribute__,
__new__,
__reduce__,
__reduce_ex__,
__setattr__,
__str__
|
|
|
_DEBUG = False
If true, then display debugging output to stdout when reading blocks.
|
|
|
_BLOCK_SIZE = 1024
The number of characters read at a time by this corpus reader.
|
|
|
_VALID_XML_RE = re.compile(r'(?sx)[^<]*(((<!--.*?-->)|(<![CDAT...
A regular expression that matches XML fragments that do not contain
any un-closed tags.
|
|
|
_XML_TAG_NAME = re.compile(r'<\s*/?\s*([^\s>]+)')
A regular expression used to extract the tag name from a start tag,
end tag, or empty-elt tag string.
|
|
|
_XML_PIECE = re.compile(r'(?sx)(?P<COMMENT><!--.*?-->)|(?P<CDA...
A regular expression used to find all start-tags, end-tags, and
emtpy-elt tags in an XML file.
|
|
|
|
|
_tagspec
The tag specification for this corpus view.
|
|
|
_tag_context
A dictionary mapping from file positions (as returned by
stream.seek() to XML contexts.
|
|
|
__init__(self,
filename,
tagspec,
elt_handler=None)
(Constructor)
| source code
|
Create a new corpus view based on a specified XML file.
Note that the XMLCorpusView constructor does not take an
encoding argument, because the unicode encoding is specified
by the XML files themselves.
- Parameters:
- Overrides:
util.StreamBackedCorpusView.__init__
|
|
Convert an element into an appropriate value for inclusion in the
view. Unless overridden by a subclass or by the elt_handler
constructor argument, this method simply returns elt.
- Parameters:
elt (ElementTree) - The element that should be converted.
context (str) - A string composed of element tags separated by forward slashes,
indicating the XML context of the given element. For example,
the string 'foo/bar/baz' indicates that the element
is a baz element whose parent is a bar
element and whose grandparent is a top-level foo
element.
- Returns:
- The view value corresponding to
elt.
|
|
Read a string from the given stream that does not contain any
un-closed tags. In particular, this function first reads a block from
the stream of size self._BLOCK_SIZE. It then
checks if that block contains an un-closed tag. If it does, then this
function either backtracks to the last '<', or reads another
block.
|
read_block(self,
stream,
tagspec=None,
elt_handler=None)
| source code
|
Read from stream until we find at least one element that
matches tagspec, and return the result of applying
elt_handler to each element found.
- Parameters:
- Returns: list of any
- a block of tokens from the input stream
- Overrides:
util.StreamBackedCorpusView.read_block
|
_VALID_XML_RE
A regular expression that matches XML fragments that do not contain
any un-closed tags.
- Value:
re.compile(r'(?sx)[^<]*(((<!--.*?-->)|(<![CDATA\[\.\*\?]\])|(<!DOCTYPE
\s+[^\[]*(\[[^\]]*\])?\s*>)|(<[^>]*>))[^<]*)*\Z')
|
|
_tag_context
A dictionary mapping from file positions (as returned by
stream.seek() to XML contexts. An XML context is a tuple of
XML tag names, indicating which tags have not yet been closed.
|