libxml_saxlib -- Python Interface to libxml SAX Interface
Contents:
Back to top
What is libxml_saxlib?
libxml_saxlib is a Python extension module that enables you to use
the SAX interface to libxml2 to parse XML documents from Python.
libxml_saxlib exposes several functions to Python: One parses an
XML document in a file; the other parses an XML document in a
Python string. When you call either of these functions, you pass
in an instance of a Python "handler" class containing call-back
methods. libxml_saxlib calls these call-back methods while parsing
the XML document as various SAX events occur. For example, when
the start of an XML element is encountered, the startElement
method in your handler instance would be called. See below for a
description of the various call-back methods.
You can learn more about libxml at
the libxml home page.
Back to top
Installation
See the README file. Basically, you will need to do the following:
- Build and install libxml2. You can find it at
http://xmlsoft.org.
- Un-compress and un-tar libxml_saxlib. Something like the
following should work:
tar xzvf libxml_saxlib-1.0a.tar.gz
- Build libxml_saxlib with something like:
python setup.py build
python setup.py install
Since libxml_saxlib needs the libxml2 libraries, you will need to
make them "findable". On my Linux machine, I used the following:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
Back to top
Calling libxml_saxlib
libxml_saxlib exposes the following functions to Python:
parse_file
This function parses an XML document stored in a file.
Prototype:
result = parse_file(handler_object, doc_file_name)
Where:
- handler_object is an instance of a Python class
containing call-back methods. See below for a description of the
call-back methods.
- doc_file_name is the name of the file containing the
XML document to be parsed.
Returns:
- 0 if the document is well-formed.
- -1 otherwise.
parse_string
This function parses an XML document stored in a string.
Prototype:
parse_string(handler_object, doc_string, doc_string_length)
Where:
- handler_object is an instance of a Python class
containing call-back methods. See below for a description of the
call-back methods.
- doc_string is a Python string containing the XML
document to be parsed.
- doc_string_length is the length of the (document)
string.
Returns:
- 0 if the document is well-formed.
- -1 otherwise.
Back to top
The Call-back Methods
Define any of the following methods in your "handler" class. Then
pass an instance of that class to the parser. The parser will
call each method, if it is defined, when the corresponding event
occurs. You may omit any of these methods, in which case no
call-back for that event occurs.
setDocumentLocator
This method provides a locator object. The locator object exposes
four methods (getPublicId, getSystemId,
getLineNumber, and getColumnNumber) which can be
called
during the parse of the document.
See
Datatype locatorobject
for more information on the
locator object.
Prototype:
def setDocumentLocator(self, locator)
startDocument
This method is called when the parser encounters the start of the
document.
Prototype:
def startDocument(self)
endDocument
This method is called when the parser encounters the end of the
document.
Prototype:
def endDocument(self)
startElement
This method is called when the parser encounters a start element
tag.
Prototype:
def startElement(self,
name, ## the element name
attrs) ## the attributes
Where:
- name is the element name (a Python string).
- attrs is the attributes. It is a Python dictionary
whose keys are the attribute names, and whose values are the
attribute values.
endElement
This method is called when the parser encounters an end element
tag.
Prototype:
def endElement(self,
name) ## the element name
Where:
- name is the element name (a Python string).
characters
This method is called when the parser encounters character data.
Prototype:
def characters(self,
chars) ## the characters
Where:
- chars is the character data (a Python string).
ignorableWhitespace
This method is called when the parser encounters ignorable
whitespace.
Prototype:
def ignorableWhitespace(self,
chars) ## the whitespace characters
Where:
- chars is the whitespace characters (a Python string).
reference
This method is called when the parser encounters and entity
reference.
Prototype:
def reference(self,
name) ## the name of the entity
Where:
- name is the name of the entity.
processingInstruction
This method is called when the parser encounters a
processingInstruction.
Prototype:
def processingInstruction(self,
target, ## the target
data) ## data
Where:
- target is the target.
- data is the data.
comment
This method is called when the parser encounters an XML comment.
Prototype:
def comment(self,
chars) ## the characters in the comment
Where:
- chars is the characters in the comment.
warning
This method is called when the parser generates a warning.
Prototype:
def warning(self,
message) ## the warning message
Where:
- message is the text of the warning message.
error
This method is called when the parser generates an error message.
Prototype:
def error(self,
message) ## the error message
Where:
- message is the text of the error message.
fatalError
This method is called when the parser generates a fatal error message.
Prototype:
def fatalError(self,
message) ## the error message
Where:
- message is the text of the error message.
cdataBlock
This method is called when the parser encounters a cdataBlock.
Prototype:
def cdataBlock(self,
chars) ## the characters
Where:
- chars is the characters in the CDATA block.
externalSubset
This method is called when the parser encounters an external subset
declaration.
Prototype:
def externalSubset(self,
name, ## the name
externalID, ## the external ID
systemID) ## the system ID
Where:
- name is the name.
- externalID is the extenal ID.
- systemID is the system ID.
entityDecl
This method is called when the parser encounters an entity
declaration.
Prototype:
def entityDecl(self,
name, ## the entity name
type, ## the entity type
publicID, ## the public ID
systemID, ## the system ID
content) ## the content
Where:
- name is the entity name.
- type is the entity type.
- publicID is the public ID.
- systemID is the system ID.
- content is the content of the declaration.
notationDecl
This method is called when the parser encounters a notation
declaration.
Prototype:
def notationDecl(self,
name, ## the entity name
publicID, ## the public ID
systemID) ## the system ID
Where:
- name is the entity name.
- publicID is the public ID.
- systemID is the system ID.
attributeDecl
This method is called when the parser encounters attribute
declaration.
Prototype:
def attributeDecl(self,
elem, ## the element name
name, ## the attribute name
type, ## the attribute type
defType, ## the type of the default value
defVal, ## the default value
enumValSet) ## enumerated value set
Where:
- elem is the name of the element for which this
attribute is defined.
- name is the attribute name.
- type is the attribute type.
- defType is the type of the default value.
- defVal is the default value.
- enumValSet is the a Python list containing the names of
the value set.
elementDecl
This method is called when the parser encounters element
declaration.
Prototype:
def elementDecl(self,
name, ## the element name
type, ## the element type
elementContent ## the element content
Where:
- name is the element name.
- type is the element type (an integer).
- elementContent is a Python object of type
xmlelementcontent the defines the content See the definition
of xmlelementcontent for a
description of this datatype and the methods it supports.
Back to top
Datatype Extensions
Datatype xmlelementcontent
xmlelementcontent is a Python datatype extension that
represents an underlying C datatype xmlElementContent.
(See include/libxml/tree.h in the libxml source distribution.)
Instances of this datatype supports the following methods:
- getName() -- Return the name of the element content.
- getElementType() -- Return the type of the element
content (a Python string). May be one of:
"PCDATA", "ELEMENT", "SEQ", "OR".
- getOccurance() -- Return the occurance of the element
content (a Python string). May be one of: "ONCE", "OPT", "MULT",
"PLUS".
- getFirstChild() -- Return the first child of the element
content (an instance of xmlelementcontent or None).
- getSecondChild() -- Return the second child of the
element content (an instance of xmlelementcontent or None).
- getParent() -- Return the parent of the element content
(an instance of xmlelementcontent or None).
Datatype locatorobject
locatorobject is a Python datatype extension that represents
an underlying C datatype xmlSAXLocator. (See
include/libxml/parser.h in the libxml source distribution.) This
underlying C type provides functions that can be used to obtain
information about the document being parsed and the state of the
parse. The Python datatype locatorobject exposes those
functions to Python.
Here is a description of the locator interface by the author of the
SAX2 specification:
"SAX parsers are strongly encouraged (though not absolutely
required) to supply a locator: if it does so, it must supply
the locator to the application by invoking this method before
invoking any of the other methods in the ContentHandler
interface.
"The locator allows the application to determine the end
position of any document-related event, even if the parser is
not reporting an error. Typically, the application will
use this information for reporting its own errors (such as
character content that does not match an application's
business rules). The information returned by the locator
is probably not sufficient for use with a search engine.
"Note that the locator will return correct information only
during the invocation of the events in this interface. The
application should not attempt to use it at any other time."
Instances of this Python locatorobject datatype support the
following methods:
- getPublicId() -- Return the public ID of the document
(a string).
- getSystemId() -- Return the system ID of the document
(a string).
- getLineNumber() -- Return the current line number (an
integer).
- getColumnNumber() -- Return the current column number
(an integer).
Here is an example of the use of the locator object:
class ExampleHandler:
def __init__(self):
self.locator = None
def setDocumentLocator(self, locator):
self.locator = locator;
def error(self, msg):
print '*** Error. msg: "%s"' % msg
lineNumber = self.locator.getLineNumber()
columnNumber = self.locator.getColumnNumber()
print '*** Line: %d Column: %d' % (lineNumber, columnNumber)
Back to top
Examples
See the files test_file.py and test_string.py in the
distribution for examples that use parse_file and
parse_string, respectively.
Back to top
Additional Information
You can learn more about libxml at
http://www.xmlsoft.org/.
Back to top
Last update: 7/6/01