libxml_domlib

Python Interface to

libxml DOM Interface

Contents:

What is libxml_domlib?

libxml_domlib is a Python extension module that enables you to use the DOM interface to libxml2 to parse XML documents from Python.

Currently, libxml_domlib gives Python access to the DOM tree and limited ability to modify the tree. Those who need more of libxml's DOM capabilities should look at the SWIG wrappers for libxml (also available at this Web site).

libxml_saxlib exposes several functions to Python: One parses an XML document in a file; the other parses an XML document in a Python string.

These functions each create a DOM document tree. libxml_domlib then provides methods that enable you to walk and inspect this document tree.

It is of interest that the document tree is represented in C objects (created by libxml). As you visit nodes in this tree, (temporary) Python objects are created to serve as "surrogates" for these C objects. Therefore, an application that visits relatively few nodes in a relatively large DOM document tree will be "light-weight" in the sense that relatively few Python objects will be created. This approach has the additional benefit that the links between nodes (parent, child, sibling, etc) are maintained in the C objects and do not cause circular references in Python objects. However, this approach has the disadvantage that if you visit the same node repeatedly, with out holding a reference to it (so that it's reference count goes to zero between each visit), then the Python object will be recreated for each visit.

You can learn more about libxml at the libxml home page.

Installation

See the README file. Basically, you will need to do the following:

Build and install libxml2. You can find it at http://xmlsoft.org.
Un-compress and un-tar libxml_domlib. Something like the following should work:
```
        tar xzvf libxml_domlib-1.0a.tar.gz
```

Build libxml_domlib with something like:

        python setup.py build
        python setup.py install

Since libxml_domlib needs the libxml2 libraries, you will need to make them "findable". On my Linux machine, I used the following:

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Calling libxml_domlib

libxml_domlib exposes the following functions to Python:

parse_file

This function parses an XML document stored in a file.

Prototype:

    document = parse_file(doc_file_name, messageHandler)

Where:

doc_file_name is the name of the file containing the XML document to be parsed.
messageHandler is an instance of an error handler class or None. And error handler class is a class that contains a write method that takes a single argument (the text of the error message). The error handling mechanism of the XSLT processor will call this method for each warning, error, and fatal error encountered, passing the text of the message. If this argument is None, normal error processing will occur, i.e. the message will be written to stdout.

Returns: A DOM document tree. This is a Python object of type domlib_doc. See the section on domlib_doc.

Call the destroyTree method on document object when you are finished with the document tree in order to free up the memory used by the underlying object.

parse_string

This function parses an XML document stored in a string.

Prototype:

    document = parse_string(doc_string, doc_string_length, messageHandler)

Where:

doc_string is a Python string containing the XML document to be parsed.
doc_string_length is the length of the (document) string.
messageHandler is an instance of an error handler class or None. And error handler class is a class that contains a write method that takes a single argument (the text of the error message). The error handling mechanism of the XSLT processor will call this method for each warning, error, and fatal error encountered, passing the text of the message. If this argument is None, normal error processing will occur, i.e. the message will be written to stdout.

Returns: A DOM document tree. This is a Python object of type domlib_doc. See the section on domlib_doc.

Call the destroyTree method on document object when you are finished with the document tree in order to free up the memory used by the underlying object.

new_node

Create a node node.

Prototype:

    node = new_node(elementName)

Where:

elementName is the name of the element, i.e. the tag name.

Datatype domlib_doc

domlib_doc is a Python surrogate object that holds a reference to an underlying libxml document object.

A domlib_doc object supports the following methods:

getElementName

Return the (document) element name (a string).

Prototype:

    name = getElementName()

getRootElement

Return the root element of the document. This is a Python object of type libdom_node. (See the section on domlib_node.)

Prototype:

    root = getRootElement()

destroyTree

Destroy the underlying document tree and free its memory. After calling this method, you must be careful not to use this object and any of the objects (nodes, attributes, etc) that you created from it.

Prototype:

    destroyTree()

Datatype domlib_node

domlib_node is a Python surrogate object that holds a reference to an underlying libxml element/node object.

A domlib_node object supports the following methods:

getElementType

Return the type of the element (a Python string). May be one of: "ELEMENT", "ATTRIBUTE", "TEXT", "CDATA", "ENTITY", "PI", "COMMENT", "DOCUMENT", "FRAG", "NOTATION", "HTML", "DTD", "ELEMENT_DECL", "ATTRIBUTE_DECL", "ENTITY_DECL", "NAMESPACE_DECL", "XINCLUDE_START", "XINCLUDE_END", "DOCB_DOCUMENT".

Prototype:

    elementType = node.getElementType()

getElementName

Return the name of the element (a Python string).

Prototype:

    name = node.getElementName()

getElementContent

Return the (text) content of an element (a Python string).

Prototype:

    content = node.getElementContent()

getAttributes

Return the attributes of the element (a Python domlib_attrs object or None). See the section on Datatype domlib_attrs (attribute).

Prototype:

    attribute = node.getAttributes()

getFirstChild

Return the first (left-most) child of the element (a Python domlib_node object).

Prototype:

    child = node.getFirstChild()

getLastChild

Return the last (right-most) child of the element (a Python domlib_node object).

Prototype:

    child = node.getLastChild()

getNextSibling

Return the next sibling of the element (a Python domlib_node object or None). The next sibling of this element is the same as the next child of the parent of this element.

Prototype:

    sibling = node.getNextSibling()

getPreviousSibling

Return the previous sibling of the element (a Python domlib_node object or None). The next sibling of this element is the same as the previous child of the parent of this element.

Prototype:

    sibling = node.getPreviousSibling()

getParent

Return the parent of the element (a Python domlib_node object or None).

Prototype:

    parent = node.getParent()

getNameSpace

Return the namespace of the element (a Python domlib_ns object or None). See the section on Datatype domlib_ns (namespace).

Prototype:

    namespace = node.getNameSpace()

getDocument

Get the document (an instance of the Python domlib_doc type) that contains this element.

Prototype:

    document = node.getDocument()

addChild

Add a new node as a child to the current node. If the current node already has children, then the new node is added as the last (right-most) child. Merge adjacent text nodes (the new node may be freed).

Note that module libxml_domlib has a method, new_node, which can be used to create a new node.

Prototype:

    currentNode.addChild(childNode)

Where:

childNode is the child to be added to the current node.

unlinkNode

Unlink a node from it's current context. The node is not freed.

Prototype:

    node.unlinkNode()

freeNode

Free a node, this is a recursive behaviour, all the children are freed, too. This doesn't unlink the child from the list; use unlinkNode() first. And, no not use this node after freeing it.

Prototype:

    node.freeNode()

replaceNode

Unlink the old node from it's current context, insert the new one in its place. If the new node was already linked into the document, it is first unlinked from its existing context.

Prototype:

    currentNode.replaceNode(oldNode, newNode)

Where:

oldNode is the node to be replaced.
newNode is the node to be inserted.

addSibling

Add a new node to the list of siblings of the current node merging adjacent TEXT nodes (the current node may be freed). If the new node was already linked into the document, it is first unlinked from its existing context. As a result of text merging, the new node may be freed.

Prototype:

    currentNode.addSibling(newNode)

Where:

newNode is the node to be added.

addNextSibling

Add a new node as the next sibling of the current node. If the new node was already linked into the document, it is first unlinked from its existing context. As a result of text merging the new element may be freed.

Prototype:

    currentNode.addNextSibling(newNode)

Where:

newNode is the node to be added.

addPrevSibling

Add a new node as the previous sibling of the current node. If the new node was already linked into the document, it is first unlinked from its existing context. As a result of text merging the new element may be freed.

Prototype:

    currentNode.addPrevSibling(newNode)

Where:

newNode is the node to be added.

addAttribute

Add a new attribute to the node or replace an existing attribute of the same name.

Prototype:

    node.addAttribute(attributeName, attributeValue)

Where:

attributeName is the name of the attribute.
attributeValue is the value of the attribute.

hasAttribute

Determine whether a node has an attribute of a given name.

Return 1 if the node has the attribute, else return None.

Prototype:

    result = node.hasAttribute(attributeName)

Where:

attributeName is the name of the attribute.

removeAttribute

Remove an attribute of a given name from a node.

Return 1 if the node has the attribute, else return None.

Prototype:

    result = node.removeAttribute(attributeName)

Where:

attributeName is the name of the attribute.

Datatype domlib_ns (namespace)

domlib_ns is a Python surrogate object that holds a reference to an underlying libxml namespace object.

A domlib_ns object supports the following methods:

getNameSpaceType

Prototype:

    namespacetype = getNameSpaceType()

getNext

Return the next namespace (an instance of the Python datatype domlib_ns).

Prototype:

    namespace = getNext()

getNameSpaceHref

Return the URL of the namespace.

Prototype:

    url = getNameSpaceHref()

getNameSpacePrefix

Return the prefix of the namespace.

Prototype:

    prefix = getNameSpacePrefix()

Datatype domlib_attrs (attributes)

domlib_attr is a Python surrogate object that holds a reference to an underlying libxml attribute object.

A domlib_attrs object supports the following methods:

getName

Return the name of the attribute (a string).

Prototype:

    name = getName()

getValue

Return the value of the attribute (a string).

Prototype:

    value = getValue()

getNext

Return the next attribute in the list or None.

Prototype:

    attr = getNext()

getPrevious

Return the previous attribute in the list or None.

Prototype:

    attr = getPrevious()

Examples

Here are a couple of simple examples:

    import sys
    import libxml_domlib as lib
    
    class MessageHandler:
        def __init__(self):
            pass
        def write(self, msg):
            file = open('errors.log', 'a')
            file.write(msg + '\n')
            file.close()
    
    # Parse a document.  Print the name of the root element.
    def test(docName):
        doc = lib.parse_file(docName, None)
        if doc:
            root = doc.getRootElement()
            name = root.getElementName()
            print 'root name:', name
            doc.destroyTree()
        else:
            print "(test) Can't parse document:", docName
    
    # Parse a document.  Print the name of the root element.
    #  Catch error messages and append them to a log file.
    def testHandler(docName):
        handler = MessageHandler()
        doc = lib.parse_file(docName, handler)
        if doc:
            root = doc.getRootElement()
            name = root.getElementName()
            print 'root name:', name
            doc.destroyTree()
        else:
            print "(testHandler) Can't parse document:", docName
    
    def usage():
        print 'Usage: python simple_test.py xml_file'
        sys.exit(-1)
    
    def main():
        args = sys.argv[1:]
        if len(args) != 1:
            usage()
        test(args[0])
        print '============================'
        testHandler(args[0])
    
    if __name__ == '__main__':
        main()

See the files visit_file.py and visit_string.py in the distribution for slightly more extensive examples that use parse_file and parse_string, respectively.

Additional Information

You can learn more about libxml at http://www.xmlsoft.org/.

Last update: 10/19/01