libxml_saxlib -- Python Interface to libxml SAX Interface

Contents:

Back to top

What is libxml_saxlib?

libxml_saxlib is a Python extension module that enables you to use the SAX interface to libxml2 to parse XML documents from Python.

libxml_saxlib exposes several functions to Python: One parses an XML document in a file; the other parses an XML document in a Python string. When you call either of these functions, you pass in an instance of a Python "handler" class containing call-back methods. libxml_saxlib calls these call-back methods while parsing the XML document as various SAX events occur. For example, when the start of an XML element is encountered, the startElement method in your handler instance would be called. See below for a description of the various call-back methods.

You can learn more about libxml at the libxml home page.

Back to top


Installation

See the README file. Basically, you will need to do the following:

  1. Build and install libxml2. You can find it at http://xmlsoft.org.

  2. Un-compress and un-tar libxml_saxlib. Something like the following should work:

            tar xzvf libxml_saxlib-1.0a.tar.gz
    
  3. Build libxml_saxlib with something like:

            python setup.py build
            python setup.py install
    
Since libxml_saxlib needs the libxml2 libraries, you will need to make them "findable". On my Linux machine, I used the following:

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
Back to top

Calling libxml_saxlib

libxml_saxlib exposes the following functions to Python:

parse_file

This function parses an XML document stored in a file.

Prototype:

    result = parse_file(handler_object, doc_file_name)
Where:

Returns:

parse_string

This function parses an XML document stored in a string.

Prototype:

    parse_string(handler_object, doc_string, doc_string_length)
Where:

Returns:

Back to top

The Call-back Methods

Define any of the following methods in your "handler" class. Then pass an instance of that class to the parser. The parser will call each method, if it is defined, when the corresponding event occurs. You may omit any of these methods, in which case no call-back for that event occurs.

setDocumentLocator

This method provides a locator object. The locator object exposes four methods (getPublicId, getSystemId, getLineNumber, and getColumnNumber) which can be called during the parse of the document. See Datatype locatorobject for more information on the locator object.

Prototype:

    def setDocumentLocator(self, locator)

startDocument

This method is called when the parser encounters the start of the document.

Prototype:

    def startDocument(self)

endDocument

This method is called when the parser encounters the end of the document.

Prototype:

    def endDocument(self)

startElement

This method is called when the parser encounters a start element tag.

Prototype:

    def startElement(self,
        name,           ## the element name
        attrs)          ## the attributes
Where:

endElement

This method is called when the parser encounters an end element tag.

Prototype:

    def endElement(self,
        name)       ## the element name
Where:

characters

This method is called when the parser encounters character data.

Prototype:

    def characters(self,
        chars)      ## the characters
Where:

ignorableWhitespace

This method is called when the parser encounters ignorable whitespace.

Prototype:

    def ignorableWhitespace(self,
        chars)      ## the whitespace characters
Where:

reference

This method is called when the parser encounters and entity reference.

Prototype:

    def reference(self,
        name)       ## the name of the entity
Where:

processingInstruction

This method is called when the parser encounters a processingInstruction.

Prototype:

    def processingInstruction(self,
        target,     ## the target
        data)       ## data
Where:

comment

This method is called when the parser encounters an XML comment.

Prototype:

    def comment(self,
        chars)      ## the characters in the comment
Where:

warning

This method is called when the parser generates a warning.

Prototype:

    def warning(self,
        message)    ## the warning message
Where:

error

This method is called when the parser generates an error message.

Prototype:

    def error(self,
        message)    ## the error message
Where:

fatalError

This method is called when the parser generates a fatal error message.

Prototype:

    def fatalError(self,
        message)    ## the error message
Where:

cdataBlock

This method is called when the parser encounters a cdataBlock.

Prototype:

    def cdataBlock(self,
        chars)      ## the characters
Where:

externalSubset

This method is called when the parser encounters an external subset declaration.

Prototype:

    def externalSubset(self,
        name,       ## the name
        externalID, ## the external ID
        systemID)   ## the system ID
Where:

entityDecl

This method is called when the parser encounters an entity declaration.

Prototype:

    def entityDecl(self,
        name,       ## the entity name
        type,       ## the entity type
        publicID,   ## the public ID
        systemID,   ## the system ID
        content)    ## the content
Where:

notationDecl

This method is called when the parser encounters a notation declaration.

Prototype:

    def notationDecl(self,
        name,       ## the entity name
        publicID,   ## the public ID
        systemID)   ## the system ID
Where:

attributeDecl

This method is called when the parser encounters attribute declaration.

Prototype:

    def attributeDecl(self,
        elem,       ## the element name
        name,       ## the attribute name
        type,       ## the attribute type
        defType,    ## the type of the default value
        defVal,     ## the default value
        enumValSet) ## enumerated value set
Where:

elementDecl

This method is called when the parser encounters element declaration.

Prototype:

    def elementDecl(self,
        name,           ## the element name
        type,           ## the element type
        elementContent  ## the element content
Where:

Back to top

Datatype Extensions

Datatype xmlelementcontent

xmlelementcontent is a Python datatype extension that represents an underlying C datatype xmlElementContent. (See include/libxml/tree.h in the libxml source distribution.)

Instances of this datatype supports the following methods:

Datatype locatorobject

locatorobject is a Python datatype extension that represents an underlying C datatype xmlSAXLocator. (See include/libxml/parser.h in the libxml source distribution.) This underlying C type provides functions that can be used to obtain information about the document being parsed and the state of the parse. The Python datatype locatorobject exposes those functions to Python.

Here is a description of the locator interface by the author of the SAX2 specification:

"SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the ContentHandler interface.

"The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application's business rules). The information returned by the locator is probably not sufficient for use with a search engine.

"Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time."

Instances of this Python locatorobject datatype support the following methods:

Here is an example of the use of the locator object:

    class ExampleHandler:
        def __init__(self):
            self.locator = None
        def setDocumentLocator(self, locator):
            self.locator = locator;
        def error(self, msg):
            print '*** Error. msg: "%s"' % msg
            lineNumber = self.locator.getLineNumber()
            columnNumber = self.locator.getColumnNumber()
            print '*** Line: %d  Column: %d' % (lineNumber, columnNumber)
Back to top

Examples

See the files test_file.py and test_string.py in the distribution for examples that use parse_file and parse_string, respectively.

Back to top


Additional Information

You can learn more about libxml at http://www.xmlsoft.org/.

Back to top


Last update: 7/6/01