PyXMLFAQ -- Python XML Frequently Asked Questions

Author:	Dave Kuhlman
Address:	dkuhlman@rexx.com http://www.rexx.com/~dkuhlman
Revision:	1.0a
Date:	May 23, 2006
Copyright:	Copyright (c) 2006 Dave Kuhlman. All Rights Reserved. This software is subject to the provisions of the MIT Open Source License http://www.opensource.org/licenses/mit-license.php.

Abstract

This document provides answers to common and frequent questions about processing XML with Python.

Questions

Where do I start?

Here are some things you can try (not necessarily in this order):

Get a book.
Try a high-level tool like minidom, Lxml, or ElementTree (see examples below).
Down-load PyXML. Install it. (Doing this is really quite easy.) Try running a few examples in the demos directory. Note, however, that much of the support in PyXML has now been added to the Python standard library.
Extend one of the demos provided by PyXML. Add some Python logic of your own.

Definitions and concepts -- This document attempts to answer a few common questions, and by doing so, hopefully will enable you to do something useful quickly. This document does not attempt to explain XML from the ground up. For an explanation of XML concepts and abbreviations used in this document, here are several good places to look:

The "Cover Pages" at http://xml.coverpages.org/xml.html.
The XML page at Wikipedia: http://en.wikipedia.org/wiki/Xml

Where can I find Python support and software for XML?

And here is a list:

The Python XML SIG (special interest group): http://www.python.org/sigs/xml-sig
Cover Pages -- Extensible Markup Language (XML): http://xml.coverpages.org/xml.html
Cover Pages -- XML and Python: http://xml.coverpages.org/xmlPython.html
My own Web page contains guides and software. It is at http://www.rexx.com/~dkuhlman.
There is documentation on the XML support in the Python standard library at Structured Markup Processing Tools in the "Python Library Reference".
ElementTree, a DOM-like API is available here: http://effbot.org/zone/element-index.htm
Lxml, which implements the ElementTree API and also provides XSLT and XPath and more, is available here: http://codespeak.net/lxml/.

How do I write a simple SAX application?

Here are some steps and hints:

Install PyXML. However, if you have a sufficiently recent version of Python, this step may not be necessary. Python now contains a good deal of XML support in it's standard library. See: Structured Markup Processing Tools in the "Python Library Reference".
Run an example in the demo directory of PyXML. The demo/saxtrace.py is a reasonable one.
Copy saxtrace.py and make a few modifications. For example, modify the startElement, endElement, and characters methods in the document handler class so that they capture information from a specific element in a Python structures.
The file doc/xml-howto.txt in the PyXML distribution contains quite a bit of how-to information on SAX.
The file doc/xml-ref.txt documents the Python SAX interface implemented by PyXML. Look there to find the definition of the handler call-back methods.

And, here are the steps you will need to make if you start from scratch:

Create a handler class. For details, see How do I write a SAX document handler?

Create the parser, set the document handler, and start the parse. For example, if your SAX handler class is MySaxDocHandler, this code should work:

def test(inFileName):
    outFile = sys.stdout
    # Create an instance of the Handler.
    handler = MySaxDocumentHandler(outFile)
    # Create an instance of the parser.
    parser = saxexts.make_parser()
    # Set the document handler.
    parser.setDocumentHandler(handler)
    inFile = open(inFileName, 'r')
    # Start the parse.
    parser.parseFile(inFile)
    inFile.close()

How do I write a SAX document handler?

A SAX handler is an instance of a Python class that implements the SAX handler interface. (Note that it can do other things as well. Whatever suits your needs.) When you start the SAX parse, the methods in the handler instance are called as the parser reaches specific events in the XML document. Since these methods respond to events, let's call them event handlers.

In this example, the events that we will be concerned with and that our example handler will respond to are:

startDocument -- Called before the parser starts processing the first (outer-most) element in the document.
endDocument -- Called when the parser reaches the end of the document.
startElement -- Called when the parser finds a start tag.
endElement -- Called when the parser finds an end tag.
characters -- Called when the parser finds data content within an element.

And, here is our example handler class and also a harness function test() that uses it:

import sys, string
from xml.sax import handler, make_parser

class MySaxDocumentHandler(handler.ContentHandler):             # [1]
    def __init__(self, outfile):                                # [2]
        self.outfile = outfile
        self.level = 0
        self.inInterest = 0
        self.interestData = []
        self.interestList = []
    def get_interestList(self):
        return self.interestList
    def set_interestList(self, interestList):
        self.interestList = interestList
    def startDocument(self):                                    # [3]
        print "--------  Document Start --------"
    def endDocument(self):                                      # [4]
        print "--------  Document End --------"
    def startElement(self, name, attrs):                        # [5]
        self.level += 1
        self.printLevel()
        self.outfile.write('Element: %s\n' % name)
        self.level += 2
        for attrName in attrs.keys():                           # [6]
            self.printLevel()
            self.outfile.write('Attribute -- Name: %s  Value: %s\n' % \
                (attrName, attrs.get(attrName)))
        self.level -= 2
        if name == 'interest':
            self.inInterest = 1
            self.interestData = []
    def endElement(self, name):                                 # [7]
        if name == 'interest':
            self.inInterest = 0
            interest = string.join(self.interestData)
            self.printLevel()
            self.outfile.write('Interest: ')
            self.outfile.write(interest)
            self.outfile.write('\n')
            self.interestList.append(interest)
        self.level -= 1
    def characters(self, chrs):                                 # [8]
        if self.inInterest:
            self.interestData.append(chrs)
    def printLevel(self):                                       # [9]
        for idx in range(self.level):
            self.outfile.write('  ')

def test(inFileName):
    outFile = sys.stdout
    # Create an instance of the Handler.
    handler = MySaxDocumentHandler(outFile)
    # Create an instance of the parser.
    parser = make_parser()
    # Set the content handler.
    parser.setContentHandler(handler)
    inFile = open(inFileName, 'r')
    # Start the parse.
    parser.parse(inFile)                                        # [10]
    # Alternatively, we could directly pass in the file name.
    #parser.parse(inFileName)
    inFile.close()
    # Print out a list of interests.
    interestList = handler.get_interestList()
    print 'Interests:'
    for interest in interestList:
        print '    %s' % (interest, )

def main():
    args = sys.argv[1:]
    if len(args) != 1:
        print 'usage: python test.py infile.xml'
        sys.exit(-1)
    test(args[0])

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Notes on the above code:

[1] We inherit from HandlerBase (which inherits from DocumentHandler, DTDHandler, EntityResolver, and ErrorHandler. This removes the need to implement all event handlers by inheriting default implementations HandlerBase.
[2] A constructor that initializes several variables for our mini-application.
[3] The event handler startDocument is called when the parse starts immediately before the first element.
[4] The event handler endDocument is called after all elements have been processed.
[5] The event handler startElement is called each time the opening tag of an element is reached. It is passed an element name (a string) and an object containing each of the attributes on the element (an instance of xml.sax.saxlib.Attributes).
[6] We list all the attributes on the element by iterating over the keys or names of the attributes and retrieving the value associated with each attribute name.
[7] The event handler endElement is called each time the closing tag of an element is reached. It receives the element name (a string). In our application we concatenate the collected substrings and write them out.
[8] The event handler characters is called for data content. Note that it may be called multiple times and you will have to concatenate the pieces. In our example, we collect the pieces in a list and later use string.join() to concatenate them together.
[9] The method printLevel is not part of the SAX interface, but is used by our mini-application.
[10] Parse the document. Events will be delivered to the event handler methods in the instance of our MySaxDocumentHandler class. Notice that we could have alternatively passed the parser the file name or a URL.

Here are some additional notes and tips concerning document handler classes:

You do not have to implement all of the event handlers defined in the SAX handler interface, and often you will not. In order not to have to do so, your handler class should inherit from a handler base class, for example xml.sax.handler.ContentHandler, which contains default implementations of all event handler methods.

How do I write a simple DOM application?

Here are some steps and hints:

Here is a simple DOM application that uses minidom:

import sys
from xml.dom import minidom, Node

def showNode(node):
    if node.nodeType == Node.ELEMENT_NODE:
        print 'Element name: %s' % node.nodeName
        for (name, value) in node.attributes.items():
            print '    Attr -- Name: %s  Value: %s' % (name, value)
        if node.attributes.get('ID') is not None:
            print '    ID: %s' % node.attributes.get('ID').value

def main():
    doc = minidom.parse(sys.argv[1])
    node = doc.documentElement
    showNode(node)
    for child in node.childNodes:
        showNode(child)

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

How do I walk a DOM tree and access the information in it?

Here is a sample function with a few notes that follow:

import sys, string
from xml.dom import minidom, Node

def walk(parent, outFile, level):                               # [1]
    for node in parent.childNodes:
        if node.nodeType == Node.ELEMENT_NODE:
            # Write out the element name.
            printLevel(outFile, level)
            outFile.write('Element: %s\n' % node.nodeName)
            # Write out the attributes.
            attrs = node.attributes                             # [2]
            for attrName in attrs.keys():
                attrNode = attrs.get(attrName)
                attrValue = attrNode.nodeValue
                printLevel(outFile, level + 2)
                outFile.write('Attribute -- Name: %s  Value: %s\n' % \
                    (attrName, attrValue))
            # Walk over any text nodes in the current node.
            content = []                                        # [3]
            for child in node.childNodes:
                if child.nodeType == Node.TEXT_NODE:
                    content.append(child.nodeValue)
            if content:
                strContent = string.join(content)
                printLevel(outFile, level)
                outFile.write('Content: "')
                outFile.write(strContent)
                outFile.write('"\n')
            # Walk the child nodes.
            walk(node, outFile, level+1)

def printLevel(outFile, level):
    for idx in range(level):
        outFile.write('    ')

def run(inFileName):                                            # [5]
    outFile = sys.stdout
    doc = minidom.parse(inFileName)
    rootNode = doc.documentElement
    level = 0
    walk(rootNode, outFile, level)

def main():
    args = sys.argv[1:]
    if len(args) != 1:
        print 'usage: python test.py infile.xml'
        sys.exit(-1)
    run(args[0])


if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Here is some notes and explanation:

[1] Walk the tree.
[2] Iterate over each attribute name (the keys) in the node. For each attribute name, get the attribute node. Then get it's value. OK, a shorter form is: "attrs.get(attrName).nodeValue".
[3] Accumulate the data content from any text nodes that are immediately under the current node.
[4] Recursively walk over nested (child) nodes of the current node.
[5] Parse the document and walk the DOM tree.

Now, here is a version that uses generators, which are new to Python 2.2:

# from __future__ import generators                             # [1]

import sys, string
from xml.dom import minidom, Node

def walkTree(node):                                             # [2]
    if node.nodeType == Node.ELEMENT_NODE:
        yield node
        for child in node.childNodes:
            for n1 in walkTree(child):
                yield n1

def showNode(node, outFile):                                    # [3]
    outFile.write('=' * 50)
    outFile.write('\n')
    # Write out the element name.
    outFile.write('Element: %s\n' % node.nodeName)
    # Write out the attributes.
    attrs = node.attributes
    for attrName in attrs.keys():
        outFile.write('Attribute -- Name: %s  Value: %s\n' % \
            (attrName, attrs.get(attrName).nodeValue))
    # Walk over any text nodes in the current node.
    content = []
    for child in node.childNodes:
        if child.nodeType == Node.TEXT_NODE:
            content.append(child.nodeValue)
    if content:
        strContent = string.join(content)
        outFile.write('Content: "')
        outFile.write(strContent)
        outFile.write('"\n')

def test(inFileName):                                           # [4]
    outFile = sys.stdout
    doc = minidom.parse(inFileName)
    rootNode = doc.documentElement
    level = 0
    for node in walkTree(rootNode):                             # [5]
        showNode(node, outFile)

def main():
    args = sys.argv[1:]
    if len(args) != 1:
        print 'usage: python test.py infile.xml'
        sys.exit(-1)
    test(args[0])

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Some notes on the above code:

[1] Tell Python that you wish to use the new generator feature. This will not be necessary in newer versions of Python (for example, in Python 2.4).
[2] Generate all and only the nodes in the DOM tree that are element nodes.
[3] Show a single element node.
[4] Parse the XML document.
[5] Iterate over all the nodes in the document tree and show each one.

A few more notes:

The version that uses a generator delivers all element nodes with no indication of the depth of nesting. If you need to know who is nested inside what, use the non-generator version.

The generator version shown here delivers parent nodes before child nodes. If you want child nodes before parent nodes, then use the following walkTree function instead:

def walkTree(node):
    if node.nodeType == Node.ELEMENT_NODE:
        for child in node.childNodes:
            for n1 in walkTree(child):
                yield n1
        yield node

   Note how this function yields the current node after walking the child
   nodes.

Are there other DOM implementations for Python?

There are several other DOM-like implementations for Python, both of which seem quite good:

ElementTree -- See: ElementTree Overview
Lxml --

"lxml is a Pythonic binding for the libxml2 and libxslt libraries. See the introduction for more information about background and goals."

"lxml follows the ElementTree API as much as possible, building it on top of the native libxml2 tree. See also the ElementTree compatibility overview.

"lxml also extends this API to expose libxml2 and libxslt specific functionality, such as XPath, Relax NG, XML Schema, XSLT, and c14n. Python code can be called from XPath expressions and XSLT stylesheets through the use of extension functions.

"In addition to the ElementTree API, lxml also features an API for implementing namespaces using tag specific element classes. This is a simple way to write arbitrary XML driven APIs on top of lxml.

"lxml also offers a SAX compliant API, that works with the SAX support in the standar dlibrary."

Learn about Lxml here: http://codespeak.net/lxml/

Here is an example using the Lxml DOM-like ElementTree API:

import sys
#from lxml import etree                                     # [1]
import elementtree.ElementTree as etree

def walk_tree(node, level):
    fill = show_level(level)
    print '%sElement name: %s' % (fill, node.tag, )
    for (name, value) in node.attrib.items():
        print '%s    Attr -- Name: %s  Value: %s' % (fill, name, value,)
    if node.attrib.get('ID') is not None:
        print '%s    ID: %s' % (fill, node.attrib.get('ID').value, )
    children = node.getchildren()
    for child in children:
        walk_tree(child, level + 1)

def show_level(level):
    s1 = '    ' * level
    return s1

def test(inFileName):
    doc = etree.parse(inFileName)
    root = doc.getroot()
    walk_tree(root, 0)

def main():
    args = sys.argv[1:]
    if len(args) != 1:
        print 'usage: python test.py infile.xml'
        sys.exit(-1)
    test(args[0])

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Notes:

[1] Since the DOM API of Lxml follows that of ElementTree, we can switch from using one to the other by changing our import statement. In addition, we can use code like the following to use Lxml if it is installed, and use ElementTree if Lxml is not installed:

try:
    from lxml import etree as ElementTree
    #print '*** using lxml'
except ImportError, e:
    try:
        from elementtree import ElementTree
        #print '*** using ElementTree'
    except ImportError, e:
        print '***'
        print '*** Error: Must install either ElementTree or lxml.'
        print '***'
        raise ImportError, 'must install either ElementTree or lxml'
doc = ElementTree.parse(specfilename)
root = doc.getroot()
# ... etc ...

How do I use DOM to add nodes to an existing XML document?

Note: This example uses lxml if it is installed, and, if lxml can't be found, uses ElementTree.

The following example reads and parses an XML document, creates a DOM tree, adds several nodes to that tree, then writes the modified tree back out to an XML document:

import sys
import os

Etree = None

def add_one_person(parent, id, name, interest):
    """Add a person element to the parent node.

    Give the new element an id attribute and name and interest
    sub-elements.
    """
    node = Etree.SubElement(parent, 'person')
    node.set('id', id)
    node1 = Etree.SubElement(node, 'name')
    node1.text = name
    node1 = Etree.SubElement(node, 'interest')
    node1.text = interest
    return node

def add_nodes(root):
    """Add several sub-elements to the root element.
    """
    add_one_person(root, '1005', 'Daniel', 'photography')
    add_one_person(root, '1006', 'Edward', 'gardening')

def test(inFileName, outFileName):
    global Etree
    try:
        from lxml import etree as ElementTree
        #print '*** using lxml'
    except ImportError, e:
        try:
            from elementtree import ElementTree
            #print '*** using ElementTree'
        except ImportError, e:
            print '***'
            print '*** Error: Must install either ElementTree or lxml.'
            print '***'
            raise ImportError, 'must install either ElementTree or lxml'
    Etree = ElementTree
    doc = Etree.parse(inFileName)
    root = doc.getroot()
    add_nodes(root)
    doc.write(outFileName)

def main():
    args = sys.argv[1:]
    if len(args) != 2:
        print 'usage: python test.py infile.xml outfile.xml'
        sys.exit(-1)
    inFileName = args[0]
    outFileName = args[1]
    if inFileName == outFileName:
        print 'error: in-file and out-file names must be different'
        sys.exit(-1)
    if os.path.exists(outFileName):
        print 'error: out-file already exists'
        sys.exit(-1)
    test(inFileName, outFileName)

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Notes:

The function add_one_person(), adds a "person" element to the parent. It adds an "id" attribute to the new element and then adds "name" and "interest" sub-elements.