libxml_domlib
Python Interface to
libxml DOM Interface
Contents:
Back to top
What is libxml_domlib?
libxml_domlib is a Python extension module that enables you to use
the DOM interface to libxml2 to parse XML documents from Python.
Currently, libxml_domlib gives Python access to the DOM tree and
limited ability to modify the tree. Those who need more of libxml's
DOM capabilities should look at the SWIG wrappers for libxml (also
available at this Web site).
libxml_saxlib exposes several functions to Python: One parses an
XML document in a file; the other parses an XML document in a
Python string.
These functions each create a DOM document tree. libxml_domlib
then provides methods that enable you to walk and inspect this
document tree.
It is of interest that the document tree is represented in C
objects (created by libxml). As you visit nodes in this tree,
(temporary) Python objects are created to serve as "surrogates" for
these C objects. Therefore, an application that visits relatively
few nodes in a relatively large DOM document tree will be
"light-weight" in the sense that relatively few Python objects will
be created. This approach has the additional benefit that the
links between nodes (parent, child, sibling, etc) are maintained in
the C objects and do not cause circular references in Python
objects. However, this approach has the disadvantage that if
you visit the same node repeatedly, with out holding a reference to
it (so that it's reference count goes to zero between each visit),
then the Python object will be recreated for each visit.
You can learn more about libxml at
the libxml home page.
Back to top
Installation
See the README file. Basically, you will need to do the following:
- Build and install libxml2. You can find it at
http://xmlsoft.org.
- Un-compress and un-tar libxml_domlib. Something like the
following should work:
tar xzvf libxml_domlib-1.0a.tar.gz
- Build libxml_domlib with something like:
python setup.py build
python setup.py install
Since libxml_domlib needs the libxml2 libraries, you will need to
make them "findable". On my Linux machine, I used the following:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
Back to top
Calling libxml_domlib
libxml_domlib exposes the following functions to Python:
parse_file
This function parses an XML document stored in a file.
Prototype:
document = parse_file(doc_file_name, messageHandler)
Where:
- doc_file_name is the name of the file containing the
XML document to be parsed.
- messageHandler is an instance of an error handler class
or None. And error handler class is a class that contains a
write method that takes a single argument (the text of the
error message). The error handling mechanism of the XSLT processor
will call this method for each warning, error, and fatal error
encountered, passing the text of the message. If this argument is
None, normal error processing will occur, i.e. the message will be
written to stdout.
Returns: A DOM document tree. This is a Python object of type
domlib_doc. See the section on domlib_doc.
Call the destroyTree method on document object when you are
finished with the document tree in order to free up the memory used
by the underlying object.
parse_string
This function parses an XML document stored in a string.
Prototype:
document = parse_string(doc_string, doc_string_length, messageHandler)
Where:
- doc_string is a Python string containing the XML
document to be parsed.
- doc_string_length is the length of the (document)
string.
- messageHandler is an instance of an error handler class
or None. And error handler class is a class that contains a
write method that takes a single argument (the text of the
error message). The error handling mechanism of the XSLT processor
will call this method for each warning, error, and fatal error
encountered, passing the text of the message. If this argument is
None, normal error processing will occur, i.e. the message will be
written to stdout.
Returns: A DOM document tree. This is a Python object of type
domlib_doc.
See the section on domlib_doc.
Call the destroyTree method on document object when you are
finished with the document tree in order to free up the memory used
by the underlying object.
new_node
Create a node node.
Prototype:
node = new_node(elementName)
Where:
- elementName is the name of the element, i.e. the tag
name.
Back to top
Datatype domlib_doc
domlib_doc is a Python surrogate object that holds a
reference to an underlying libxml document object.
A domlib_doc object supports the following methods:
getElementName
Return the (document) element name (a string).
Prototype:
name = getElementName()
getRootElement
Return the root element of the document. This is a Python object
of type libdom_node.
(See the section on domlib_node.)
Prototype:
root = getRootElement()
destroyTree
Destroy the underlying document tree and free its memory. After
calling this method, you must be careful not to use this object and
any of the objects (nodes, attributes, etc) that you created from
it.
Prototype:
destroyTree()
Back to top
Datatype domlib_node
domlib_node is a Python surrogate object that holds a
reference to an underlying libxml element/node object.
A domlib_node object supports the following methods:
getElementType
Return the type of the element (a Python string). May be one of:
"ELEMENT", "ATTRIBUTE", "TEXT", "CDATA", "ENTITY", "PI", "COMMENT",
"DOCUMENT", "FRAG", "NOTATION", "HTML", "DTD", "ELEMENT_DECL",
"ATTRIBUTE_DECL", "ENTITY_DECL", "NAMESPACE_DECL",
"XINCLUDE_START", "XINCLUDE_END", "DOCB_DOCUMENT".
Prototype:
elementType = node.getElementType()
getElementName
Return the name of the element (a Python string).
Prototype:
name = node.getElementName()
getElementContent
Return the (text) content of an element (a Python string).
Prototype:
content = node.getElementContent()
getAttributes
Return the attributes of the element (a Python domlib_attrs
object or None). See the section on
Datatype domlib_attrs (attribute).
Prototype:
attribute = node.getAttributes()
getFirstChild
Return the first (left-most) child of the element (a Python
domlib_node object).
Prototype:
child = node.getFirstChild()
getLastChild
Return the last (right-most) child of the element (a Python
domlib_node object).
Prototype:
child = node.getLastChild()
getNextSibling
Return the next sibling of the element (a Python domlib_node object
or None). The next sibling of this element is the same as the next
child of the parent of this element.
Prototype:
sibling = node.getNextSibling()
getPreviousSibling
Return the previous sibling of the element (a Python domlib_node
object or None). The next sibling of this element is the same as
the previous child of the parent of this element.
Prototype:
sibling = node.getPreviousSibling()
getParent
Return the parent of the element (a Python domlib_node
object or None).
Prototype:
parent = node.getParent()
getNameSpace
Return the namespace of the element (a Python domlib_ns
object or None). See the section on
Datatype domlib_ns (namespace).
Prototype:
namespace = node.getNameSpace()
getDocument
Get the document (an instance of the Python domlib_doc type) that
contains this element.
Prototype:
document = node.getDocument()
addChild
Add a new node as a child to the current node. If the current node
already has children, then the new node is added as the last
(right-most) child. Merge adjacent text nodes (the new node
may be freed).
Note that module libxml_domlib has a method, new_node, which
can be used to create a new node.
Prototype:
currentNode.addChild(childNode)
Where:
- childNode is the child to be added to the current node.
unlinkNode
Unlink a node from it's current context. The node is not freed.
Prototype:
node.unlinkNode()
freeNode
Free a node, this is a recursive behaviour, all the children are
freed, too. This doesn't unlink the child from the list; use
unlinkNode() first. And, no not use this node after freeing it.
Prototype:
node.freeNode()
replaceNode
Unlink the old node from it's current context, insert the new one in
its place. If the new node was already linked into the document, it
is first unlinked from its existing context.
Prototype:
currentNode.replaceNode(oldNode, newNode)
Where:
- oldNode is the node to be replaced.
- newNode is the node to be inserted.
addSibling
Add a new node to the list of siblings of the current node
merging adjacent TEXT nodes (the current node may be freed).
If the new node was already linked into the document, it is
first unlinked from its existing context. As a result of text
merging, the new node may be freed.
Prototype:
currentNode.addSibling(newNode)
Where:
- newNode is the node to be added.
addNextSibling
Add a new node as the next sibling of the current node. If the
new node was already linked into the document, it is first
unlinked from its existing context. As a result of text merging the
new element may be freed.
Prototype:
currentNode.addNextSibling(newNode)
Where:
- newNode is the node to be added.
addPrevSibling
Add a new node as the previous sibling of the current node. If the
new node was already linked into the document, it is first
unlinked from its existing context. As a result of text merging the
new element may be freed.
Prototype:
currentNode.addPrevSibling(newNode)
Where:
- newNode is the node to be added.
addAttribute
Add a new attribute to the node or replace an existing attribute of
the same name.
Prototype:
node.addAttribute(attributeName, attributeValue)
Where:
- attributeName is the name of the attribute.
- attributeValue is the value of the attribute.
hasAttribute
Determine whether a node has an attribute of a given name.
Return 1 if the node has the attribute, else return None.
Prototype:
result = node.hasAttribute(attributeName)
Where:
- attributeName is the name of the attribute.
removeAttribute
Remove an attribute of a given name from a node.
Return 1 if the node has the attribute, else return None.
Prototype:
result = node.removeAttribute(attributeName)
Where:
- attributeName is the name of the attribute.
Back to top
Datatype domlib_ns (namespace)
domlib_ns is a Python surrogate object that holds a
reference to an underlying libxml namespace object.
A domlib_ns object supports the following methods:
getNameSpaceType
Return the type of the element (a Python string). May be one of:
"ELEMENT", "ATTRIBUTE", "TEXT", "CDATA", "ENTITY", "PI", "COMMENT",
"DOCUMENT", "FRAG", "NOTATION", "HTML", "DTD", "ELEMENT_DECL",
"ATTRIBUTE_DECL", "ENTITY_DECL", "NAMESPACE_DECL",
"XINCLUDE_START", "XINCLUDE_END", "DOCB_DOCUMENT".
Prototype:
namespacetype = getNameSpaceType()
getNext
Return the next namespace (an instance of the Python datatype
domlib_ns).
Prototype:
namespace = getNext()
getNameSpaceHref
Return the URL of the namespace.
Prototype:
url = getNameSpaceHref()
getNameSpacePrefix
Return the prefix of the namespace.
Prototype:
prefix = getNameSpacePrefix()
Back to top
Datatype domlib_attrs (attributes)
domlib_attr is a Python surrogate object that holds a
reference to an underlying libxml attribute object.
A domlib_attrs object supports the following methods:
getName
Return the name of the attribute (a string).
Prototype:
name = getName()
getValue
Return the value of the attribute (a string).
Prototype:
value = getValue()
getNext
Return the next attribute in the list or None.
Prototype:
attr = getNext()
getPrevious
Return the previous attribute in the list or None.
Prototype:
attr = getPrevious()
Back to top
Examples
Here are a couple of simple examples:
import sys
import libxml_domlib as lib
class MessageHandler:
def __init__(self):
pass
def write(self, msg):
file = open('errors.log', 'a')
file.write(msg + '\n')
file.close()
# Parse a document. Print the name of the root element.
def test(docName):
doc = lib.parse_file(docName, None)
if doc:
root = doc.getRootElement()
name = root.getElementName()
print 'root name:', name
doc.destroyTree()
else:
print "(test) Can't parse document:", docName
# Parse a document. Print the name of the root element.
# Catch error messages and append them to a log file.
def testHandler(docName):
handler = MessageHandler()
doc = lib.parse_file(docName, handler)
if doc:
root = doc.getRootElement()
name = root.getElementName()
print 'root name:', name
doc.destroyTree()
else:
print "(testHandler) Can't parse document:", docName
def usage():
print 'Usage: python simple_test.py xml_file'
sys.exit(-1)
def main():
args = sys.argv[1:]
if len(args) != 1:
usage()
test(args[0])
print '============================'
testHandler(args[0])
if __name__ == '__main__':
main()
See the files visit_file.py and visit_string.py in
the distribution for slightly more extensive examples that use
parse_file and parse_string, respectively.
Back to top
Additional Information
You can learn more about libxml at
http://www.xmlsoft.org/.
Back to top
Last update: 10/19/01