===================================================
 PyXMLFAQ -- Python XML Frequently Asked Questions
===================================================

:Author: Dave Kuhlman
:address: dkuhlman@rexx.com
    http://www.rexx.com/~dkuhlman

:revision: 1.0a
:date: May 23, 2006

:copyright: Copyright (c) 2006 Dave Kuhlman.  All Rights Reserved.
    This software is subject to the provisions of the MIT Open Source
    License http://www.opensource.org/licenses/mit-license.php.

:abstract: This document provides answers to common and frequent
    questions about processing XML with Python.


.. contents:: Questions


Where do I start?
=================

Here are some things you can try (not necessarily in this order):

- Get a book.

- Try a high-level tool like minidom, Lxml, or ElementTree (see
  examples below).

- Down-load PyXML. Install it. (Doing this is really quite easy.) Try
  running a few examples in the demos directory.  Note, however, that
  much of the support in PyXML has now been added to the Python
  standard library.

- Extend one of the demos provided by PyXML. Add some Python logic of
  your own.

Definitions and concepts -- This document attempts to answer a few
common questions, and by doing so, hopefully will enable you to do
something useful quickly.  This document does not attempt to explain
XML from the ground up.  For an explanation of XML concepts and
abbreviations used in this document, here are several good places to
look:

- The "Cover Pages" at http://xml.coverpages.org/xml.html.

- The XML page at Wikipedia: http://en.wikipedia.org/wiki/Xml


Where can I find Python support and software for XML?
=====================================================

And here is a list:

- The Python XML SIG (special interest group):
  http://www.python.org/sigs/xml-sig

- Cover Pages -- Extensible Markup Language (XML):
  http://xml.coverpages.org/xml.html

- Cover Pages -- XML and Python:
  http://xml.coverpages.org/xmlPython.html

- My own Web page contains guides and software. It is at
  http://www.rexx.com/~dkuhlman.

- There is documentation on the XML support in the Python standard
  library at `Structured Markup Processing Tools in the "Python
  Library Reference"`_.

- ElementTree, a DOM-like API is available here:
  http://effbot.org/zone/element-index.htm

- Lxml, which implements the ElementTree API and also provides XSLT
  and XPath and more, is available here:
  http://codespeak.net/lxml/.

.. _`Structured Markup Processing Tools in the "Python Library Reference"`:
    http://docs.python.org/lib/markup.html


How do I write a simple SAX application?
========================================

Here are some steps and hints:

1. Install PyXML. However, if you have a sufficiently recent version
   of Python, this step may not be necessary. Python now contains a
   good deal of XML support in it's standard library. See:
   `Structured Markup Processing Tools in the "Python
   Library Reference"`_.

2. Run an example in the demo directory of PyXML. The
   ``demo/saxtrace.py`` is a reasonable one.

3. Copy ``saxtrace.py`` and make a few modifications. For example,
   modify the startElement, endElement, and characters methods in the
   document handler class so that they capture information from a
   specific element in a Python structures.

4. The file ``doc/xml-howto.txt`` in the PyXML distribution contains
   quite a bit of how-to information on SAX.

5. The file ``doc/xml-ref.txt`` documents the Python SAX interface
   implemented by PyXML. Look there to find the definition of the
   handler call-back methods.

And, here are the steps you will need to make if you start from scratch:

1. Create a handler class. For details, see
   `How do I write a SAX document handler?`_

2. Create the parser, set the document handler, and start the
   parse. For example, if your SAX handler class is ``MySaxDocHandler``,
   this code should work::

       def test(inFileName):
           outFile = sys.stdout
           # Create an instance of the Handler.
           handler = MySaxDocumentHandler(outFile)
           # Create an instance of the parser.
           parser = saxexts.make_parser()
           # Set the document handler.
           parser.setDocumentHandler(handler)
           inFile = open(inFileName, 'r')
           # Start the parse.
           parser.parseFile(inFile)
           inFile.close()


How do I write a SAX document handler?
======================================

A SAX handler is an instance of a Python class that implements the SAX
handler interface. (Note that it can do other things as well. Whatever
suits your needs.) When you start the SAX parse, the methods in the
handler instance are called as the parser reaches specific events in
the XML document. Since these methods respond to events, let's call
them event handlers.

In this example, the events that we will be concerned with and that
our example handler will respond to are:

- startDocument -- Called before the parser starts processing the first
  (outer-most) element in the document.

- endDocument -- Called when the parser reaches the end of the document.

- startElement -- Called when the parser finds a start tag.

- endElement -- Called when the parser finds an end tag.

- characters -- Called when the parser finds data content within an
  element.

And, here is our example handler class and also a harness function
`test()` that uses it::

    import sys, string
    from xml.sax import handler, make_parser

    class MySaxDocumentHandler(handler.ContentHandler):             # [1]
        def __init__(self, outfile):                                # [2]
            self.outfile = outfile
            self.level = 0
            self.inInterest = 0
            self.interestData = []
            self.interestList = []
        def get_interestList(self):
            return self.interestList
        def set_interestList(self, interestList):
            self.interestList = interestList
        def startDocument(self):                                    # [3]
            print "--------  Document Start --------"
        def endDocument(self):                                      # [4]
            print "--------  Document End --------"
        def startElement(self, name, attrs):                        # [5]
            self.level += 1
            self.printLevel()
            self.outfile.write('Element: %s\n' % name)
            self.level += 2
            for attrName in attrs.keys():                           # [6]
                self.printLevel()
                self.outfile.write('Attribute -- Name: %s  Value: %s\n' % \
                    (attrName, attrs.get(attrName)))
            self.level -= 2
            if name == 'interest':
                self.inInterest = 1
                self.interestData = []
        def endElement(self, name):                                 # [7]
            if name == 'interest':
                self.inInterest = 0
                interest = string.join(self.interestData)
                self.printLevel()
                self.outfile.write('Interest: ')
                self.outfile.write(interest)
                self.outfile.write('\n')
                self.interestList.append(interest)
            self.level -= 1
        def characters(self, chrs):                                 # [8]
            if self.inInterest:
                self.interestData.append(chrs)
        def printLevel(self):                                       # [9]
            for idx in range(self.level):
                self.outfile.write('  ')

    def test(inFileName):
        outFile = sys.stdout
        # Create an instance of the Handler.
        handler = MySaxDocumentHandler(outFile)
        # Create an instance of the parser.
        parser = make_parser()
        # Set the content handler.
        parser.setContentHandler(handler)
        inFile = open(inFileName, 'r')
        # Start the parse.
        parser.parse(inFile)                                        # [10]
        # Alternatively, we could directly pass in the file name.
        #parser.parse(inFileName)
        inFile.close()
        # Print out a list of interests.
        interestList = handler.get_interestList()
        print 'Interests:'
        for interest in interestList:
            print '    %s' % (interest, )

    def main():
        args = sys.argv[1:]
        if len(args) != 1:
            print 'usage: python test.py infile.xml'
            sys.exit(-1)
        test(args[0])

    if __name__ == '__main__':
        main()

A sample XML input file is here: `people.xml`_.

Notes on the above code:

- [1] We inherit from HandlerBase (which inherits from DocumentHandler,
  DTDHandler, EntityResolver, and ErrorHandler. This removes the need to
  implement all event handlers by inheriting default implementations
  HandlerBase.

- [2] A constructor that initializes several variables for our
  mini-application.

- [3] The event handler startDocument is called when the parse starts
  immediately before the first element.

- [4] The event handler endDocument is called after all elements have
  been processed.

- [5] The event handler startElement is called each time the opening tag
  of an element is reached. It is passed an element name (a string) and
  an object containing each of the attributes on the element (an
  instance of xml.sax.saxlib.Attributes).

- [6] We list all the attributes on the element by iterating over the
  keys or names of the attributes and retrieving the value associated
  with each attribute name.

- [7] The event handler endElement is called each time the closing tag
  of an element is reached. It receives the element name (a string). In
  our application we concatenate the collected substrings and write them
  out.

- [8] The event handler characters is called for data content. Note that
  it may be called multiple times and you will have to concatenate the
  pieces. In our example, we collect the pieces in a list and later use
  string.join() to concatenate them together.

- [9] The method printLevel is not part of the SAX interface, but is
  used by our mini-application.

- [10] Parse the document.  Events will be delivered to the event
  handler methods in the instance of our `MySaxDocumentHandler`
  class.  Notice that we could have alternatively passed the parser
  the file name or a URL.

Here are some additional notes and tips concerning document handler
classes:

- You do not have to implement all of the event handlers defined in
  the SAX handler interface, and often you will not. In order not to
  have to do so, your handler class should inherit from a handler base
  class, for example `xml.sax.handler.ContentHandler`, which contains
  default implementations of all event handler methods.


How do I write a simple DOM application?
========================================

Here are some steps and hints:

Here is a simple DOM application that uses minidom::

    import sys
    from xml.dom import minidom, Node

    def showNode(node):
        if node.nodeType == Node.ELEMENT_NODE:
            print 'Element name: %s' % node.nodeName
            for (name, value) in node.attributes.items():
                print '    Attr -- Name: %s  Value: %s' % (name, value)
            if node.attributes.get('ID') is not None:
                print '    ID: %s' % node.attributes.get('ID').value

    def main():
        doc = minidom.parse(sys.argv[1])
        node = doc.documentElement
        showNode(node)
        for child in node.childNodes:
            showNode(child)

    if __name__ == '__main__':
        main()

A sample XML input file is here: `people.xml`_.


How do I walk a DOM tree and access the information in it?
==========================================================

Here is a sample function with a few notes that follow::

    import sys, string
    from xml.dom import minidom, Node

    def walk(parent, outFile, level):                               # [1]
        for node in parent.childNodes:
            if node.nodeType == Node.ELEMENT_NODE:
                # Write out the element name.
                printLevel(outFile, level)
                outFile.write('Element: %s\n' % node.nodeName)
                # Write out the attributes.
                attrs = node.attributes                             # [2]
                for attrName in attrs.keys():
                    attrNode = attrs.get(attrName)
                    attrValue = attrNode.nodeValue
                    printLevel(outFile, level + 2)
                    outFile.write('Attribute -- Name: %s  Value: %s\n' % \
                        (attrName, attrValue))
                # Walk over any text nodes in the current node.
                content = []                                        # [3]
                for child in node.childNodes:
                    if child.nodeType == Node.TEXT_NODE:
                        content.append(child.nodeValue)
                if content:
                    strContent = string.join(content)
                    printLevel(outFile, level)
                    outFile.write('Content: "')
                    outFile.write(strContent)
                    outFile.write('"\n')
                # Walk the child nodes.
                walk(node, outFile, level+1)

    def printLevel(outFile, level):
        for idx in range(level):
            outFile.write('    ')

    def run(inFileName):                                            # [5]
        outFile = sys.stdout
        doc = minidom.parse(inFileName)
        rootNode = doc.documentElement
        level = 0
        walk(rootNode, outFile, level)

    def main():
        args = sys.argv[1:]
        if len(args) != 1:
            print 'usage: python test.py infile.xml'
            sys.exit(-1)
        run(args[0])


    if __name__ == '__main__':
        main()

A sample XML input file is here: `people.xml`_.

Here is some notes and explanation:

- [1] Walk the tree.

- [2] Iterate over each attribute name (the keys) in the node. For each
  attribute name, get the attribute node. Then get it's value. OK, a
  shorter form is: "attrs.get(attrName).nodeValue".

- [3] Accumulate the data content from any text nodes that are
  immediately under the current node.

- [4] Recursively walk over nested (child) nodes of the current node.

- [5] Parse the document and walk the DOM tree.

Now, here is a version that uses generators, which are new to Python
2.2::

    # from __future__ import generators                             # [1]

    import sys, string
    from xml.dom import minidom, Node

    def walkTree(node):                                             # [2]
        if node.nodeType == Node.ELEMENT_NODE:
            yield node
            for child in node.childNodes:
                for n1 in walkTree(child):
                    yield n1

    def showNode(node, outFile):                                    # [3]
        outFile.write('=' * 50)
        outFile.write('\n')
        # Write out the element name.
        outFile.write('Element: %s\n' % node.nodeName)
        # Write out the attributes.
        attrs = node.attributes
        for attrName in attrs.keys():
            outFile.write('Attribute -- Name: %s  Value: %s\n' % \
                (attrName, attrs.get(attrName).nodeValue))
        # Walk over any text nodes in the current node.
        content = []
        for child in node.childNodes:
            if child.nodeType == Node.TEXT_NODE:
                content.append(child.nodeValue)
        if content:
            strContent = string.join(content)
            outFile.write('Content: "')
            outFile.write(strContent)
            outFile.write('"\n')

    def test(inFileName):                                           # [4]
        outFile = sys.stdout
        doc = minidom.parse(inFileName)
        rootNode = doc.documentElement
        level = 0
        for node in walkTree(rootNode):                             # [5]
            showNode(node, outFile)

    def main():
        args = sys.argv[1:]
        if len(args) != 1:
            print 'usage: python test.py infile.xml'
            sys.exit(-1)
        test(args[0])

    if __name__ == '__main__':
        main()

A sample XML input file is here: `people.xml`_.

Some notes on the above code:

- [1] Tell Python that you wish to use the new generator feature. This
  will not be necessary in newer versions of Python (for example, in
  Python 2.4).

- [2] Generate all and only the nodes in the DOM tree that are element
  nodes.

- [3] Show a single element node.

- [4] Parse the XML document.

- [5] Iterate over all the nodes in the document tree and show each
  one.

A few more notes:

- The version that uses a generator delivers all element nodes with no
  indication of the depth of nesting. If you need to know who is
  nested inside what, use the non-generator version.

- The generator version shown here delivers parent nodes before child
  nodes. If you want child nodes before parent nodes, then use the
  following walkTree function instead::

    def walkTree(node):
        if node.nodeType == Node.ELEMENT_NODE:
            for child in node.childNodes:
                for n1 in walkTree(child):
                    yield n1
            yield node

       Note how this function yields the current node after walking the child
       nodes.


Are there other DOM implementations for Python?
===============================================

There are several other DOM-like implementations for Python, both of
which seem quite good:

- ElementTree -- See: `ElementTree Overview
  <http://effbot.org/zone/element-index.htm>`_

- Lxml -- 

      "lxml is a Pythonic binding for the libxml2 and libxslt
      libraries. See the introduction for more information about
      background and goals."

      "lxml follows the ElementTree API as much as possible, building
      it on top of the native libxml2 tree. See also the ElementTree
      compatibility overview.

      "lxml also extends this API to expose libxml2 and libxslt
      specific functionality, such as XPath, Relax NG, XML Schema,
      XSLT, and c14n. Python code can be called from XPath expressions
      and XSLT stylesheets through the use of extension functions.

      "In addition to the ElementTree API, lxml also features an API
      for implementing namespaces using tag specific element
      classes. This is a simple way to write arbitrary XML driven APIs
      on top of lxml.

      "lxml also offers a SAX compliant API, that works with the SAX
      support in the standar dlibrary."

  Learn about Lxml here: http://codespeak.net/lxml/

Here is an example using the Lxml DOM-like ElementTree API::

    import sys
    #from lxml import etree                                     # [1]
    import elementtree.ElementTree as etree

    def walk_tree(node, level):
        fill = show_level(level)
        print '%sElement name: %s' % (fill, node.tag, )
        for (name, value) in node.attrib.items():
            print '%s    Attr -- Name: %s  Value: %s' % (fill, name, value,)
        if node.attrib.get('ID') is not None:
            print '%s    ID: %s' % (fill, node.attrib.get('ID').value, )
        children = node.getchildren()
        for child in children:
            walk_tree(child, level + 1)

    def show_level(level):
        s1 = '    ' * level
        return s1

    def test(inFileName):
        doc = etree.parse(inFileName)
        root = doc.getroot()
        walk_tree(root, 0)

    def main():
        args = sys.argv[1:]
        if len(args) != 1:
            print 'usage: python test.py infile.xml'
            sys.exit(-1)
        test(args[0])

    if __name__ == '__main__':
        main()

A sample XML input file is here: `people.xml`_.

Notes:

- [1] Since the DOM API of Lxml follows that of ElementTree, we can
  switch from using one to the other by changing our import statement.
  In addition, we can use code like the following to use Lxml if it is
  installed, and use ElementTree if Lxml is not installed::

      try:
          from lxml import etree as ElementTree
          #print '*** using lxml'
      except ImportError, e:
          try:
              from elementtree import ElementTree
              #print '*** using ElementTree'
          except ImportError, e:
              print '***'
              print '*** Error: Must install either ElementTree or lxml.'
              print '***'
              raise ImportError, 'must install either ElementTree or lxml'
      doc = ElementTree.parse(specfilename)
      root = doc.getroot()
      # ... etc ...


How do I use DOM to add nodes to an existing XML document?
==========================================================

Note: This example uses lxml if it is installed, and, if lxml can't be
found, uses ElementTree.

The following example reads and parses an XML document, creates a DOM
tree, adds several nodes to that tree, then writes the modified tree
back out to an XML document::

    import sys
    import os

    Etree = None

    def add_one_person(parent, id, name, interest):
        """Add a person element to the parent node.

        Give the new element an id attribute and name and interest
        sub-elements.
        """
        node = Etree.SubElement(parent, 'person')
        node.set('id', id)
        node1 = Etree.SubElement(node, 'name')
        node1.text = name
        node1 = Etree.SubElement(node, 'interest')
        node1.text = interest
        return node

    def add_nodes(root):
        """Add several sub-elements to the root element.
        """
        add_one_person(root, '1005', 'Daniel', 'photography')
        add_one_person(root, '1006', 'Edward', 'gardening')

    def test(inFileName, outFileName):
        global Etree
        try:
            from lxml import etree as ElementTree
            #print '*** using lxml'
        except ImportError, e:
            try:
                from elementtree import ElementTree
                #print '*** using ElementTree'
            except ImportError, e:
                print '***'
                print '*** Error: Must install either ElementTree or lxml.'
                print '***'
                raise ImportError, 'must install either ElementTree or lxml'
        Etree = ElementTree
        doc = Etree.parse(inFileName)
        root = doc.getroot()
        add_nodes(root)
        doc.write(outFileName)

    def main():
        args = sys.argv[1:]
        if len(args) != 2:
            print 'usage: python test.py infile.xml outfile.xml'
            sys.exit(-1)
        inFileName = args[0]
        outFileName = args[1]
        if inFileName == outFileName:
            print 'error: in-file and out-file names must be different'
            sys.exit(-1)
        if os.path.exists(outFileName):
            print 'error: out-file already exists'
            sys.exit(-1)
        test(inFileName, outFileName)

    if __name__ == '__main__':
        main()

A sample XML input file is here: `people.xml`_.


Notes:

- The function ``add_one_person()``, adds a "person" element to the
  parent.  It adds an "id" attribute to the new element and then adds
  "name" and "interest" sub-elements.


.. _`people.xml`:
    people.xml