Author: | Dave Kuhlman |
---|---|
Address: | dkuhlman@rexx.com http://www.rexx.com/~dkuhlman |
Revision: | 1.0a |
Date: | May 23, 2006 |
Copyright: | Copyright (c) 2006 Dave Kuhlman. All Rights Reserved. This software is subject to the provisions of the MIT Open Source License http://www.opensource.org/licenses/mit-license.php. |
Abstract
This document provides answers to common and frequent questions about processing XML with Python.
Here are some things you can try (not necessarily in this order):
Definitions and concepts -- This document attempts to answer a few common questions, and by doing so, hopefully will enable you to do something useful quickly. This document does not attempt to explain XML from the ground up. For an explanation of XML concepts and abbreviations used in this document, here are several good places to look:
And here is a list:
Here are some steps and hints:
And, here are the steps you will need to make if you start from scratch:
Create a handler class. For details, see How do I write a SAX document handler?
Create the parser, set the document handler, and start the parse. For example, if your SAX handler class is MySaxDocHandler, this code should work:
def test(inFileName): outFile = sys.stdout # Create an instance of the Handler. handler = MySaxDocumentHandler(outFile) # Create an instance of the parser. parser = saxexts.make_parser() # Set the document handler. parser.setDocumentHandler(handler) inFile = open(inFileName, 'r') # Start the parse. parser.parseFile(inFile) inFile.close()
A SAX handler is an instance of a Python class that implements the SAX handler interface. (Note that it can do other things as well. Whatever suits your needs.) When you start the SAX parse, the methods in the handler instance are called as the parser reaches specific events in the XML document. Since these methods respond to events, let's call them event handlers.
In this example, the events that we will be concerned with and that our example handler will respond to are:
And, here is our example handler class and also a harness function test() that uses it:
import sys, string from xml.sax import handler, make_parser class MySaxDocumentHandler(handler.ContentHandler): # [1] def __init__(self, outfile): # [2] self.outfile = outfile self.level = 0 self.inInterest = 0 self.interestData = [] self.interestList = [] def get_interestList(self): return self.interestList def set_interestList(self, interestList): self.interestList = interestList def startDocument(self): # [3] print "-------- Document Start --------" def endDocument(self): # [4] print "-------- Document End --------" def startElement(self, name, attrs): # [5] self.level += 1 self.printLevel() self.outfile.write('Element: %s\n' % name) self.level += 2 for attrName in attrs.keys(): # [6] self.printLevel() self.outfile.write('Attribute -- Name: %s Value: %s\n' % \ (attrName, attrs.get(attrName))) self.level -= 2 if name == 'interest': self.inInterest = 1 self.interestData = [] def endElement(self, name): # [7] if name == 'interest': self.inInterest = 0 interest = string.join(self.interestData) self.printLevel() self.outfile.write('Interest: ') self.outfile.write(interest) self.outfile.write('\n') self.interestList.append(interest) self.level -= 1 def characters(self, chrs): # [8] if self.inInterest: self.interestData.append(chrs) def printLevel(self): # [9] for idx in range(self.level): self.outfile.write(' ') def test(inFileName): outFile = sys.stdout # Create an instance of the Handler. handler = MySaxDocumentHandler(outFile) # Create an instance of the parser. parser = make_parser() # Set the content handler. parser.setContentHandler(handler) inFile = open(inFileName, 'r') # Start the parse. parser.parse(inFile) # [10] # Alternatively, we could directly pass in the file name. #parser.parse(inFileName) inFile.close() # Print out a list of interests. interestList = handler.get_interestList() print 'Interests:' for interest in interestList: print ' %s' % (interest, ) def main(): args = sys.argv[1:] if len(args) != 1: print 'usage: python test.py infile.xml' sys.exit(-1) test(args[0]) if __name__ == '__main__': main()
A sample XML input file is here: people.xml.
Notes on the above code:
Here are some additional notes and tips concerning document handler classes:
Here are some steps and hints:
Here is a simple DOM application that uses minidom:
import sys from xml.dom import minidom, Node def showNode(node): if node.nodeType == Node.ELEMENT_NODE: print 'Element name: %s' % node.nodeName for (name, value) in node.attributes.items(): print ' Attr -- Name: %s Value: %s' % (name, value) if node.attributes.get('ID') is not None: print ' ID: %s' % node.attributes.get('ID').value def main(): doc = minidom.parse(sys.argv[1]) node = doc.documentElement showNode(node) for child in node.childNodes: showNode(child) if __name__ == '__main__': main()
A sample XML input file is here: people.xml.
Here is a sample function with a few notes that follow:
import sys, string from xml.dom import minidom, Node def walk(parent, outFile, level): # [1] for node in parent.childNodes: if node.nodeType == Node.ELEMENT_NODE: # Write out the element name. printLevel(outFile, level) outFile.write('Element: %s\n' % node.nodeName) # Write out the attributes. attrs = node.attributes # [2] for attrName in attrs.keys(): attrNode = attrs.get(attrName) attrValue = attrNode.nodeValue printLevel(outFile, level + 2) outFile.write('Attribute -- Name: %s Value: %s\n' % \ (attrName, attrValue)) # Walk over any text nodes in the current node. content = [] # [3] for child in node.childNodes: if child.nodeType == Node.TEXT_NODE: content.append(child.nodeValue) if content: strContent = string.join(content) printLevel(outFile, level) outFile.write('Content: "') outFile.write(strContent) outFile.write('"\n') # Walk the child nodes. walk(node, outFile, level+1) def printLevel(outFile, level): for idx in range(level): outFile.write(' ') def run(inFileName): # [5] outFile = sys.stdout doc = minidom.parse(inFileName) rootNode = doc.documentElement level = 0 walk(rootNode, outFile, level) def main(): args = sys.argv[1:] if len(args) != 1: print 'usage: python test.py infile.xml' sys.exit(-1) run(args[0]) if __name__ == '__main__': main()
A sample XML input file is here: people.xml.
Here is some notes and explanation:
Now, here is a version that uses generators, which are new to Python 2.2:
# from __future__ import generators # [1] import sys, string from xml.dom import minidom, Node def walkTree(node): # [2] if node.nodeType == Node.ELEMENT_NODE: yield node for child in node.childNodes: for n1 in walkTree(child): yield n1 def showNode(node, outFile): # [3] outFile.write('=' * 50) outFile.write('\n') # Write out the element name. outFile.write('Element: %s\n' % node.nodeName) # Write out the attributes. attrs = node.attributes for attrName in attrs.keys(): outFile.write('Attribute -- Name: %s Value: %s\n' % \ (attrName, attrs.get(attrName).nodeValue)) # Walk over any text nodes in the current node. content = [] for child in node.childNodes: if child.nodeType == Node.TEXT_NODE: content.append(child.nodeValue) if content: strContent = string.join(content) outFile.write('Content: "') outFile.write(strContent) outFile.write('"\n') def test(inFileName): # [4] outFile = sys.stdout doc = minidom.parse(inFileName) rootNode = doc.documentElement level = 0 for node in walkTree(rootNode): # [5] showNode(node, outFile) def main(): args = sys.argv[1:] if len(args) != 1: print 'usage: python test.py infile.xml' sys.exit(-1) test(args[0]) if __name__ == '__main__': main()
A sample XML input file is here: people.xml.
Some notes on the above code:
A few more notes:
The version that uses a generator delivers all element nodes with no indication of the depth of nesting. If you need to know who is nested inside what, use the non-generator version.
The generator version shown here delivers parent nodes before child nodes. If you want child nodes before parent nodes, then use the following walkTree function instead:
def walkTree(node): if node.nodeType == Node.ELEMENT_NODE: for child in node.childNodes: for n1 in walkTree(child): yield n1 yield node Note how this function yields the current node after walking the child nodes.
There are several other DOM-like implementations for Python, both of which seem quite good:
ElementTree -- See: ElementTree Overview
Lxml --
"lxml is a Pythonic binding for the libxml2 and libxslt libraries. See the introduction for more information about background and goals."
"lxml follows the ElementTree API as much as possible, building it on top of the native libxml2 tree. See also the ElementTree compatibility overview.
"lxml also extends this API to expose libxml2 and libxslt specific functionality, such as XPath, Relax NG, XML Schema, XSLT, and c14n. Python code can be called from XPath expressions and XSLT stylesheets through the use of extension functions.
"In addition to the ElementTree API, lxml also features an API for implementing namespaces using tag specific element classes. This is a simple way to write arbitrary XML driven APIs on top of lxml.
"lxml also offers a SAX compliant API, that works with the SAX support in the standar dlibrary."
Learn about Lxml here: http://codespeak.net/lxml/
Here is an example using the Lxml DOM-like ElementTree API:
import sys #from lxml import etree # [1] import elementtree.ElementTree as etree def walk_tree(node, level): fill = show_level(level) print '%sElement name: %s' % (fill, node.tag, ) for (name, value) in node.attrib.items(): print '%s Attr -- Name: %s Value: %s' % (fill, name, value,) if node.attrib.get('ID') is not None: print '%s ID: %s' % (fill, node.attrib.get('ID').value, ) children = node.getchildren() for child in children: walk_tree(child, level + 1) def show_level(level): s1 = ' ' * level return s1 def test(inFileName): doc = etree.parse(inFileName) root = doc.getroot() walk_tree(root, 0) def main(): args = sys.argv[1:] if len(args) != 1: print 'usage: python test.py infile.xml' sys.exit(-1) test(args[0]) if __name__ == '__main__': main()
A sample XML input file is here: people.xml.
Notes:
[1] Since the DOM API of Lxml follows that of ElementTree, we can switch from using one to the other by changing our import statement. In addition, we can use code like the following to use Lxml if it is installed, and use ElementTree if Lxml is not installed:
try: from lxml import etree as ElementTree #print '*** using lxml' except ImportError, e: try: from elementtree import ElementTree #print '*** using ElementTree' except ImportError, e: print '***' print '*** Error: Must install either ElementTree or lxml.' print '***' raise ImportError, 'must install either ElementTree or lxml' doc = ElementTree.parse(specfilename) root = doc.getroot() # ... etc ...
Note: This example uses lxml if it is installed, and, if lxml can't be found, uses ElementTree.
The following example reads and parses an XML document, creates a DOM tree, adds several nodes to that tree, then writes the modified tree back out to an XML document:
import sys import os Etree = None def add_one_person(parent, id, name, interest): """Add a person element to the parent node. Give the new element an id attribute and name and interest sub-elements. """ node = Etree.SubElement(parent, 'person') node.set('id', id) node1 = Etree.SubElement(node, 'name') node1.text = name node1 = Etree.SubElement(node, 'interest') node1.text = interest return node def add_nodes(root): """Add several sub-elements to the root element. """ add_one_person(root, '1005', 'Daniel', 'photography') add_one_person(root, '1006', 'Edward', 'gardening') def test(inFileName, outFileName): global Etree try: from lxml import etree as ElementTree #print '*** using lxml' except ImportError, e: try: from elementtree import ElementTree #print '*** using ElementTree' except ImportError, e: print '***' print '*** Error: Must install either ElementTree or lxml.' print '***' raise ImportError, 'must install either ElementTree or lxml' Etree = ElementTree doc = Etree.parse(inFileName) root = doc.getroot() add_nodes(root) doc.write(outFileName) def main(): args = sys.argv[1:] if len(args) != 2: print 'usage: python test.py infile.xml outfile.xml' sys.exit(-1) inFileName = args[0] outFileName = args[1] if inFileName == outFileName: print 'error: in-file and out-file names must be different' sys.exit(-1) if os.path.exists(outFileName): print 'error: out-file already exists' sys.exit(-1) test(inFileName, outFileName) if __name__ == '__main__': main()
A sample XML input file is here: people.xml.
Notes: