PyXMLFAQ -- Python XML Frequently Asked Questions

Author: Dave Kuhlman
Address:
dkuhlman@rexx.com
http://www.rexx.com/~dkuhlman
Revision: 1.0a
Date: May 23, 2006
Copyright: Copyright (c) 2006 Dave Kuhlman. All Rights Reserved. This software is subject to the provisions of the MIT Open Source License http://www.opensource.org/licenses/mit-license.php.

Abstract

This document provides answers to common and frequent questions about processing XML with Python.

Questions

Where do I start?

Here are some things you can try (not necessarily in this order):

Definitions and concepts -- This document attempts to answer a few common questions, and by doing so, hopefully will enable you to do something useful quickly. This document does not attempt to explain XML from the ground up. For an explanation of XML concepts and abbreviations used in this document, here are several good places to look:

Where can I find Python support and software for XML?

And here is a list:

How do I write a simple SAX application?

Here are some steps and hints:

  1. Install PyXML. However, if you have a sufficiently recent version of Python, this step may not be necessary. Python now contains a good deal of XML support in it's standard library. See: Structured Markup Processing Tools in the "Python Library Reference".
  2. Run an example in the demo directory of PyXML. The demo/saxtrace.py is a reasonable one.
  3. Copy saxtrace.py and make a few modifications. For example, modify the startElement, endElement, and characters methods in the document handler class so that they capture information from a specific element in a Python structures.
  4. The file doc/xml-howto.txt in the PyXML distribution contains quite a bit of how-to information on SAX.
  5. The file doc/xml-ref.txt documents the Python SAX interface implemented by PyXML. Look there to find the definition of the handler call-back methods.

And, here are the steps you will need to make if you start from scratch:

  1. Create a handler class. For details, see How do I write a SAX document handler?

  2. Create the parser, set the document handler, and start the parse. For example, if your SAX handler class is MySaxDocHandler, this code should work:

    def test(inFileName):
        outFile = sys.stdout
        # Create an instance of the Handler.
        handler = MySaxDocumentHandler(outFile)
        # Create an instance of the parser.
        parser = saxexts.make_parser()
        # Set the document handler.
        parser.setDocumentHandler(handler)
        inFile = open(inFileName, 'r')
        # Start the parse.
        parser.parseFile(inFile)
        inFile.close()
    

How do I write a SAX document handler?

A SAX handler is an instance of a Python class that implements the SAX handler interface. (Note that it can do other things as well. Whatever suits your needs.) When you start the SAX parse, the methods in the handler instance are called as the parser reaches specific events in the XML document. Since these methods respond to events, let's call them event handlers.

In this example, the events that we will be concerned with and that our example handler will respond to are:

And, here is our example handler class and also a harness function test() that uses it:

import sys, string
from xml.sax import handler, make_parser

class MySaxDocumentHandler(handler.ContentHandler):             # [1]
    def __init__(self, outfile):                                # [2]
        self.outfile = outfile
        self.level = 0
        self.inInterest = 0
        self.interestData = []
        self.interestList = []
    def get_interestList(self):
        return self.interestList
    def set_interestList(self, interestList):
        self.interestList = interestList
    def startDocument(self):                                    # [3]
        print "--------  Document Start --------"
    def endDocument(self):                                      # [4]
        print "--------  Document End --------"
    def startElement(self, name, attrs):                        # [5]
        self.level += 1
        self.printLevel()
        self.outfile.write('Element: %s\n' % name)
        self.level += 2
        for attrName in attrs.keys():                           # [6]
            self.printLevel()
            self.outfile.write('Attribute -- Name: %s  Value: %s\n' % \
                (attrName, attrs.get(attrName)))
        self.level -= 2
        if name == 'interest':
            self.inInterest = 1
            self.interestData = []
    def endElement(self, name):                                 # [7]
        if name == 'interest':
            self.inInterest = 0
            interest = string.join(self.interestData)
            self.printLevel()
            self.outfile.write('Interest: ')
            self.outfile.write(interest)
            self.outfile.write('\n')
            self.interestList.append(interest)
        self.level -= 1
    def characters(self, chrs):                                 # [8]
        if self.inInterest:
            self.interestData.append(chrs)
    def printLevel(self):                                       # [9]
        for idx in range(self.level):
            self.outfile.write('  ')

def test(inFileName):
    outFile = sys.stdout
    # Create an instance of the Handler.
    handler = MySaxDocumentHandler(outFile)
    # Create an instance of the parser.
    parser = make_parser()
    # Set the content handler.
    parser.setContentHandler(handler)
    inFile = open(inFileName, 'r')
    # Start the parse.
    parser.parse(inFile)                                        # [10]
    # Alternatively, we could directly pass in the file name.
    #parser.parse(inFileName)
    inFile.close()
    # Print out a list of interests.
    interestList = handler.get_interestList()
    print 'Interests:'
    for interest in interestList:
        print '    %s' % (interest, )

def main():
    args = sys.argv[1:]
    if len(args) != 1:
        print 'usage: python test.py infile.xml'
        sys.exit(-1)
    test(args[0])

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Notes on the above code:

Here are some additional notes and tips concerning document handler classes:

How do I write a simple DOM application?

Here are some steps and hints:

Here is a simple DOM application that uses minidom:

import sys
from xml.dom import minidom, Node

def showNode(node):
    if node.nodeType == Node.ELEMENT_NODE:
        print 'Element name: %s' % node.nodeName
        for (name, value) in node.attributes.items():
            print '    Attr -- Name: %s  Value: %s' % (name, value)
        if node.attributes.get('ID') is not None:
            print '    ID: %s' % node.attributes.get('ID').value

def main():
    doc = minidom.parse(sys.argv[1])
    node = doc.documentElement
    showNode(node)
    for child in node.childNodes:
        showNode(child)

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

How do I walk a DOM tree and access the information in it?

Here is a sample function with a few notes that follow:

import sys, string
from xml.dom import minidom, Node

def walk(parent, outFile, level):                               # [1]
    for node in parent.childNodes:
        if node.nodeType == Node.ELEMENT_NODE:
            # Write out the element name.
            printLevel(outFile, level)
            outFile.write('Element: %s\n' % node.nodeName)
            # Write out the attributes.
            attrs = node.attributes                             # [2]
            for attrName in attrs.keys():
                attrNode = attrs.get(attrName)
                attrValue = attrNode.nodeValue
                printLevel(outFile, level + 2)
                outFile.write('Attribute -- Name: %s  Value: %s\n' % \
                    (attrName, attrValue))
            # Walk over any text nodes in the current node.
            content = []                                        # [3]
            for child in node.childNodes:
                if child.nodeType == Node.TEXT_NODE:
                    content.append(child.nodeValue)
            if content:
                strContent = string.join(content)
                printLevel(outFile, level)
                outFile.write('Content: "')
                outFile.write(strContent)
                outFile.write('"\n')
            # Walk the child nodes.
            walk(node, outFile, level+1)

def printLevel(outFile, level):
    for idx in range(level):
        outFile.write('    ')

def run(inFileName):                                            # [5]
    outFile = sys.stdout
    doc = minidom.parse(inFileName)
    rootNode = doc.documentElement
    level = 0
    walk(rootNode, outFile, level)

def main():
    args = sys.argv[1:]
    if len(args) != 1:
        print 'usage: python test.py infile.xml'
        sys.exit(-1)
    run(args[0])


if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Here is some notes and explanation:

Now, here is a version that uses generators, which are new to Python 2.2:

# from __future__ import generators                             # [1]

import sys, string
from xml.dom import minidom, Node

def walkTree(node):                                             # [2]
    if node.nodeType == Node.ELEMENT_NODE:
        yield node
        for child in node.childNodes:
            for n1 in walkTree(child):
                yield n1

def showNode(node, outFile):                                    # [3]
    outFile.write('=' * 50)
    outFile.write('\n')
    # Write out the element name.
    outFile.write('Element: %s\n' % node.nodeName)
    # Write out the attributes.
    attrs = node.attributes
    for attrName in attrs.keys():
        outFile.write('Attribute -- Name: %s  Value: %s\n' % \
            (attrName, attrs.get(attrName).nodeValue))
    # Walk over any text nodes in the current node.
    content = []
    for child in node.childNodes:
        if child.nodeType == Node.TEXT_NODE:
            content.append(child.nodeValue)
    if content:
        strContent = string.join(content)
        outFile.write('Content: "')
        outFile.write(strContent)
        outFile.write('"\n')

def test(inFileName):                                           # [4]
    outFile = sys.stdout
    doc = minidom.parse(inFileName)
    rootNode = doc.documentElement
    level = 0
    for node in walkTree(rootNode):                             # [5]
        showNode(node, outFile)

def main():
    args = sys.argv[1:]
    if len(args) != 1:
        print 'usage: python test.py infile.xml'
        sys.exit(-1)
    test(args[0])

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Some notes on the above code:

A few more notes:

Are there other DOM implementations for Python?

There are several other DOM-like implementations for Python, both of which seem quite good:

Here is an example using the Lxml DOM-like ElementTree API:

import sys
#from lxml import etree                                     # [1]
import elementtree.ElementTree as etree

def walk_tree(node, level):
    fill = show_level(level)
    print '%sElement name: %s' % (fill, node.tag, )
    for (name, value) in node.attrib.items():
        print '%s    Attr -- Name: %s  Value: %s' % (fill, name, value,)
    if node.attrib.get('ID') is not None:
        print '%s    ID: %s' % (fill, node.attrib.get('ID').value, )
    children = node.getchildren()
    for child in children:
        walk_tree(child, level + 1)

def show_level(level):
    s1 = '    ' * level
    return s1

def test(inFileName):
    doc = etree.parse(inFileName)
    root = doc.getroot()
    walk_tree(root, 0)

def main():
    args = sys.argv[1:]
    if len(args) != 1:
        print 'usage: python test.py infile.xml'
        sys.exit(-1)
    test(args[0])

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Notes:

How do I use DOM to add nodes to an existing XML document?

Note: This example uses lxml if it is installed, and, if lxml can't be found, uses ElementTree.

The following example reads and parses an XML document, creates a DOM tree, adds several nodes to that tree, then writes the modified tree back out to an XML document:

import sys
import os

Etree = None

def add_one_person(parent, id, name, interest):
    """Add a person element to the parent node.

    Give the new element an id attribute and name and interest
    sub-elements.
    """
    node = Etree.SubElement(parent, 'person')
    node.set('id', id)
    node1 = Etree.SubElement(node, 'name')
    node1.text = name
    node1 = Etree.SubElement(node, 'interest')
    node1.text = interest
    return node

def add_nodes(root):
    """Add several sub-elements to the root element.
    """
    add_one_person(root, '1005', 'Daniel', 'photography')
    add_one_person(root, '1006', 'Edward', 'gardening')

def test(inFileName, outFileName):
    global Etree
    try:
        from lxml import etree as ElementTree
        #print '*** using lxml'
    except ImportError, e:
        try:
            from elementtree import ElementTree
            #print '*** using ElementTree'
        except ImportError, e:
            print '***'
            print '*** Error: Must install either ElementTree or lxml.'
            print '***'
            raise ImportError, 'must install either ElementTree or lxml'
    Etree = ElementTree
    doc = Etree.parse(inFileName)
    root = doc.getroot()
    add_nodes(root)
    doc.write(outFileName)

def main():
    args = sys.argv[1:]
    if len(args) != 2:
        print 'usage: python test.py infile.xml outfile.xml'
        sys.exit(-1)
    inFileName = args[0]
    outFileName = args[1]
    if inFileName == outFileName:
        print 'error: in-file and out-file names must be different'
        sys.exit(-1)
    if os.path.exists(outFileName):
        print 'error: out-file already exists'
        sys.exit(-1)
    test(inFileName, outFileName)

if __name__ == '__main__':
    main()

A sample XML input file is here: people.xml.

Notes: