Class XMLCorpusView
source code
object --+
|
util.AbstractLazySequence --+
|
util.StreamBackedCorpusView --+
|
XMLCorpusView
- Known Subclasses:
-
A corpus view that selects out specified elements from an XML file,
and provides a flat list-like interface for accessing them. (Note:
XMLCorpusView
is not used by XMLCorpusReader itself, but may be used by subclasses of
XMLCorpusReader.)
Every XML corpus view has a tag specification, indicating what XML elements
should be included in the view; and each (non-nested) element that
matches this specification corresponds to one item in the view. Tag
specifications are regular expressions over tag paths, where a tag path
is a list of element tag names, sepaated by '/', indicating the ancestry
of the element. Some examples:
-
'foo'
: A top-level element whose tag is
foo
.
-
'foo/bar'
: An element whose tag is bar
and
whose parent is a top-level element whose tag is foo
.
-
'.*/foo'
: An element whose tag is foo
,
appearing anywhere in the xml tree.
-
'.*/(foo|bar)'
: An wlement whose tag is foo
or bar
, appearing anywhere in the xml tree.
The view items are generated from the selected XML elements via the
method handle_elt(). By default, this method returns the
element as-is (i.e., as an ElementTree object); but it can be overridden,
either via subclassing or via the elt_handler
constructor
parameter.
|
__init__(self,
filename,
tagspec,
elt_handler=None)
Create a new corpus view based on a specified XML file. |
source code
|
|
|
|
|
handle_elt(self,
elt,
context)
Convert an element into an appropriate value for inclusion in the
view. |
source code
|
|
|
|
list of any
|
read_block(self,
stream,
tagspec=None,
elt_handler=None)
Read from stream until we find at least one element that
matches tagspec , and return the result of applying
elt_handler to each element found. |
source code
|
|
Inherited from util.StreamBackedCorpusView :
__add__ ,
__getitem__ ,
__len__ ,
__mul__ ,
__radd__ ,
__rmul__ ,
close ,
iterate_from
Inherited from util.AbstractLazySequence :
__cmp__ ,
__contains__ ,
__hash__ ,
__iter__ ,
__repr__ ,
count ,
index
Inherited from object :
__delattr__ ,
__getattribute__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__setattr__ ,
__str__
|
|
_DEBUG = False
If true, then display debugging output to stdout when reading blocks.
|
|
_BLOCK_SIZE = 1024
The number of characters read at a time by this corpus reader.
|
|
_VALID_XML_RE = re.compile(r'(?sx) [^ <] * ( ( ( <!--.*? -->) | ( <![ CDAT...
A regular expression that matches XML fragments that do not contain
any un-closed tags.
|
|
_XML_TAG_NAME = re.compile(r'<\s* /? \s* ( [ ^ \s>] + ) ')
A regular expression used to extract the tag name from a start tag,
end tag, or empty-elt tag string.
|
|
_XML_PIECE = re.compile(r'(?sx) (?P< COMMENT > <!--.*? -->) | (?P< CDA ...
A regular expression used to find all start-tags, end-tags, and
emtpy-elt tags in an XML file.
|
|
|
_tagspec
The tag specification for this corpus view.
|
|
_tag_context
A dictionary mapping from file positions (as returned by
stream.seek() to XML contexts.
|
|
__init__(self,
filename,
tagspec,
elt_handler=None)
(Constructor)
| source code
|
Create a new corpus view based on a specified XML file.
Note that the XMLCorpusView constructor does not take an
encoding argument, because the unicode encoding is specified
by the XML files themselves.
- Parameters:
- Overrides:
util.StreamBackedCorpusView.__init__
|
Convert an element into an appropriate value for inclusion in the
view. Unless overridden by a subclass or by the elt_handler
constructor argument, this method simply returns elt .
- Parameters:
elt (ElementTree ) - The element that should be converted.
context (str ) - A string composed of element tags separated by forward slashes,
indicating the XML context of the given element. For example,
the string 'foo/bar/baz' indicates that the element
is a baz element whose parent is a bar
element and whose grandparent is a top-level foo
element.
- Returns:
- The view value corresponding to
elt .
|
Read a string from the given stream that does not contain any
un-closed tags. In particular, this function first reads a block from
the stream of size self._BLOCK_SIZE. It then
checks if that block contains an un-closed tag. If it does, then this
function either backtracks to the last '<', or reads another
block.
|
read_block(self,
stream,
tagspec=None,
elt_handler=None)
| source code
|
Read from stream until we find at least one element that
matches tagspec , and return the result of applying
elt_handler to each element found.
- Parameters:
- Returns: list of any
- a block of tokens from the input stream
- Overrides:
util.StreamBackedCorpusView.read_block
|
_VALID_XML_RE
A regular expression that matches XML fragments that do not contain
any un-closed tags.
- Value:
re.compile(r'(?sx) [^ <] * ( ( ( <!--.*? -->) | ( <![ CDATA\[\.\*\?] \]) | ( <!DOCTYPE
\s+ [^ \[] * ( \[[^ \]] * \]) ? \s* >) | ( <[^ >] * >) ) [^ <] * ) * \Z')
|
|
_tag_context
A dictionary mapping from file positions (as returned by
stream.seek() to XML contexts. An XML context is a tuple of
XML tag names, indicating which tags have not yet been closed.
|