HTML Screen Scraping: A How-To Document

Author:	Dave Kuhlman
Address:	dkuhlman@rexx.com http://www.rexx.com/~dkuhlman
Revision:	1.0a
Date:	Jan. 9, 2004
Copyright:	Copyright (c) 2004 Dave Kuhlman. This documentation is covered by The MIT License: http://www.opensource.org/licenses/mit-license.

Abstract

This document explains how to do HTML screen scraping. In effect it shows how to treat the Web as a resource by enabling you to retrieve and extract data from HTML Web pages.

Contents

1ï¿½ï¿½ï¿½Introduction

The Web contains a huge amount of information. This document shows how to use the Web as a back-end resource behind your Quixote Web applications.

This document explains how to do this behind your Quixote application (although similar techniques could be used in other environments as well). In particular, we will describe:

The use of urllib to retrieve Web pages.
How to use sgrep, a structure grep, to search and extract chunks from HTML Web pages.
The use of regular expressions and the Python re module to extract data from the chunks of text returned by sgrep.
And, where the chunks returned by sgrep contain sufficiently complex HTML mark-up, we consider the use of the Python HTMLParser module to analyze and extract data from chunks of HTML mark-up.

There is a distribution file containing the source code from which the samples in this document were selected. You can find it here: http://www.rexx.com/~dkuhlman/quixote_htmlscraping.zip.

The Web as a back-end resource for your Quixote applications is just one of a number of resources that you can put behind your Quixote application. Other possible back-end resources are, for example relational databases, XML-RPC, SOAP, and Python extension modules. For more information on how to access several other back-end resources under Quixote, see Special Tasks -- Back-end Resources in A Quixote Application: Getting Started.

2ï¿½ï¿½ï¿½Command-line Use

It is often useful to be able to make quick tests of your patterns by running them from the command-line. Here is a simple function that does so. It uses the urllib module to retrieve the Web page and the popen2 module to run sgrep:

import urllib
import popen2

#
# Scan the contents of a URL.
# Retrieve the URL, then feed the contents (a string) to sgrep.
#
def search_url(pattern, url, addOptions):
    if not addOptions:
        addOptions = ''
    options = "-g html -o '%r:::' -T " + addOptions
    cmd = "sgrep %s '%s' -" % (options, pattern)
    print 'cmd:', cmd
    try:
        instream = urllib.urlopen(url)
    except IOError:
        print '*** bad url: %s' % url
        return
    content = instream.read()
    instream.close()
    print 'len(content): %d' % len(content)
    # Feed the content through sgrep and collect the results.
    outfile, infile = popen2.popen2(cmd)
    infile.write(content)
    infile.close()
    results = outfile.read()
    outfile.close()
    print 'results:\n========\n'
    resultlist = results.split(':::')
    for result in resultlist:
        if result.strip():
            print result
            print '---------------'

Explanation:

Module popen2 is the mechanism that we use to feed content to sgrep. We write the Web page content to the input file returned by popen2. And, we read from the output file returned by popen2 to retrieve the output of the sgrep command.
Notice that we use sgrep's output formatting option ("-o") to insert a special character sequence (in the above example ":::") after each of the results returned by sgrep. Doing so enables us to split the returned result into a list of individual values. This is especially handy when the individual values themselves contain new-line characters.

3ï¿½ï¿½ï¿½From Within Quixote

OK, I'll admit it. Doing this inside Quixote is basically the same as doing it outside of Quixote. Again, you are going to use urllib to retrieve a Web page, then use sgrep to extract chunks of text from the page, and finally use the Python regular expression module re (or some other Python parsing technique) to extract data items from the results returned by sgrep.

One concern that we might have is latency. In particular, we might worry that:

Our server might block while our Quixote application is waiting for a request (e.g. through urllib) to be completed.
The client might notice an extra delay while our Quixote application is retrieving a Web page.

Here is an attempt to ease your worries:

The possibility of blocking during a request is eliminated by using the SCGI server for Quixote. This server creates a new process for each request if an existing process is not available to handle the request. Because these request handlers are running in separate processes, they do not block each other. This should reassure even those of you who worry about the Python global interpreter lock (GIL). And, the maximum number of processes can be increased at server start-up time.
A noticable delay for the client can't be helped. Some things take time. In some applications you may be able to cache results.

4ï¿½ï¿½ï¿½Sgrep Patterns

This section gives help with writing sgrep patterns that select data within an HTML document. It contains examples of sgrep patterns for typical data extraction tasks that you are likely to want to perform.

A few examples:

Extract elements directly containing an attribute "HREF" whose value contains "python":
```
elements parenting (attribute("HREF") containing "python")
```
Extract attribute values for attribute "HREF" whose values contains "python":
```
attribute("HREF") containing attvalue("*")
```
Extract data content of an element containing an attribute "HREF" whose value contains "python":
```
stag("A") containing (attribute("HREF") containing "python") __ etag("A")
```
Note that the "__" (double underscore) operator selects a non-inclusive region. In this case, it selects a region that does not include the start and end tags. The ".." (double dot) operator, by contrast, selects an inclusive region. Replacing "__" with ".." would have returned the data content as well as the "A" tags that surround it.
Extract the attribute value for all "HREF" attributes:
```
attvalue("*") in attribute("HREF")
```
Extract all "TR" elements which contain a "TD" element that immediately contains an "A" element.

(stag("TR") .. etag("TR")) containing (stag("TD") .. etag("TD")) parenting (stag("A") .. etag("A"))

But, see note about problems with using more that one "parenting" operator, below.

A few additional comments and notes:

The README file in the sgrep distribution has more sample patterns as well as documentation on the latest features. And, documentation on the sgrep query language is at: http://www.cs.helsinki.fi/u/jjaakkol/sgrepman.html.
Use the "-g html" option to tell sgrep that it should process HTML content.
Using more that one "parenting" operator seems to cause problems. You will have to try it yourself. Feature or bug?
Indexing -- If caching Web pages on your local machine works for your application and you decide to do so, look at the indexing operations and options supported by sgrep, which can speed up searches.

5ï¿½ï¿½ï¿½Running sgrep

There are two techniques for running sgrep:

Use pysgrep -- pysgrep is a Python extension module that enables structured searches from within Python.
Use module popen2 to run sgrep at the command-line.

The use of pysgrep is described in PySgrep - Python Wrappers for Sgrep, so I will not repeat that here.

Using popen2 is simple and will be used in the examples below.

Comparison -- PySgrep vs. popen2:

I'm concerned that, since sgrep itself was designed for command-line use and exits after each query, there may be memory leaks that I haven't discovered and that may impact pysgrep. If anyone gains experience that confirms or disproves this worry, please send me an account of it.
Since using popen2 starts a separate process, I'm guessing that it might be slower. If speed is a special worry for you, then investigate pysgrep. On the other hand, in comparison with the latency caused by retrieving a Web page (e.g. with urllib), the time consumed by popen2 might be insignificant.

6ï¿½ï¿½ï¿½Sgrep plus regular expressions

sgrep does not (yet) have the capability to use regular expressions. You can get some of the effect of regular expressions by using Python's re module and applying it to results produced by sgrep.

Basically we are going to use a regular expression to extract pieces of data from each of the chunks returned by sgrep.

An example -- Suppose that we want to extract the server and domain name from URLs. For example, suppose sgrep returns something like the following:

http://www.python.org/doc/current/tut/tut.html

And, we want to extract:

www.python.org

Here is how we might do that:

#
# Scan files.
# Read the files, then feed the file contents (a string) to sgrep.
#
def search_files(pattern, filenames, addOptions, regex):
    if not addOptions:
        addOptions = ''
    options = "-g html -o '%r:::' " + addOptions
    #
    # Compile the regular expression if there is one.
    expr = None
    if regex:
        expr = re.compile(regex)
    cmd = "sgrep %s '%s' -" % (options, pattern)
    for filename in filenames:
        inputfile = file(filename, 'r')
        outfile, infile = popen2.popen2(cmd)
        infile.write(inputfile.read())
        infile.close()
        results = outfile.read()
        outfile.close()
        print '=' * 50
        s1 = 'file: %s' % filename
        print s1
        print '=' * len(s1)
        resultlist = results.split(':::')
        for result in resultlist:
            if result.strip():
                print result
                #
                # If there is a regular expression, use it to
                #   search the result.
                if expr:
                    matchobject = expr.search(result)
                    if matchobject:
                        print 'match: %s' % matchobject.group(1)
                    else:
                        print 'no match'
                print '---------------'

Explanation:

If this function is passed a regular expression, then it compiles the regular expression and uses it to search each result returned by sgrep.
If the regular expression search succeeds (i.e. if "expr.search(result)" returns a match object, then we retrieve the first group from that match object. We are assuming that the regular expression contains at least one group (in particular that it has at least one set of parentheses).
In order to extract the server and domain from a URL, we might pass this function the following regular expression:
```
'https?://([^/]*)'
```
For more on the Python regular expressions, see re -- Regular expression operations.

7ï¿½ï¿½ï¿½Sgrep plus HTMLParser

In some cases the chunks of text returned by sgrep may contain HTML mark-up that is sufficiently complex so that it becomes very awkward to use regular expressions to analyze it. In other cases, it just may be easier to ask sgrep to return such chunks of HTML mark-up. This section describes how to use the Python HTMLParser module to analyze these chunks of text.

This technique is limited to some extent by the need to give the HTMLParser.feed() chunks of mark-up that are "complete", that is the contain balanced tags. However, this requirement is easy to satisfy with sgrep, because any query of the form "(stag("tag") .. etag("tag")) parenting ..." will return a chunk of HTML mark-up that we can give to the HTMLParser.feed(data) method. And, note that there is no requirement to feed a "complete" chunk of mark-up to HTMLParser.feed(data) in a single call; we can call feed multiple times in order to do so.

Here is a reasonably simple example. This example searches http://jobs.com with a query, then extracts (1) a brief job description, (2) a URL, (3) the company name, and (4) the job location, then formats a Web page with this extracted information.

Here is the code that does the query and data extraction:

class JobService:

    #
    # Process a single query.
    # Return a tuple: (urlList, descriptionList, companyList, locationList).
    # The query is a sequence of words separated by spaces.
    #
    def job_search(self, query):
        if query:
            q1 = query.replace(' ', '.')
            q2 = query.replace(' ', '%26')
            q3 = query.replace(' ', '+')
        else:
            return [], [], [], []
        req = 'http://%s.jobs.com/jobsearch.asp?re=9&vw=b&pg=1&cy=US&sq=%s&aj=%s' \
            % (q1, q2, q3)
        f = urllib.urlopen(req)
        content = f.read()
        f.close()
        resultTuple = self.job_parse(content)
        return resultTuple

    def job_parse(self, content):
        # Extract the URLs.
        cmd = "sgrep -g html -o '%r:::' 'attribute(\"HREF\") in " \
              "((stag(\"A\") .. etag(\"A\")) childrening " \
              "(stag(\"TD\") .. etag(\"TD\")) containing " \
              "attribute(\"HREF\") containing \"getjob\")' -"
        urlList = self.extract(cmd, content)
        # Extract the descriptions.
        cmd = "sgrep -g html -o '%r:::' '(stag(\"A\") __ etag(\"A\")) in " \
              "((stag(\"A\") .. etag(\"A\")) childrening " \
              "(stag(\"TD\") .. etag(\"TD\")) containing " \
              "attribute(\"HREF\") containing \"getjob\")' -"
        descriptionList = self.extract(cmd, content)
        # Extract the company names and locations.
        cmd = "sgrep -g html -o '%r:::' '(stag(\"TR\") .. etag(\"TR\")) containing " \
              "(stag(\"TD\") .. etag(\"TD\")) parenting " \
              "stag(\"A\") containing " \
              "attribute(\"HREF\") containing \"getjob\"'"
        companyList, locationList = self.extract_with_htmlparser(cmd, content)
        return urlList, descriptionList, companyList, locationList

    def extract(self, cmd, content):
        outfile, infile = popen2.popen2(cmd)
        infile.write(content)
        infile.close()
        results = outfile.read()
        outfile.close()
        resultList = results.split(':::')
        return resultList

    def extract_with_htmlparser(self, cmd, content):
        parser = LocationHTMLParser()
        outfile, infile = popen2.popen2(cmd)
        infile.write(content)
        infile.close()
        results = outfile.read()
        outfile.close()
        resultList = results.split(':::')
        companyList = []
        locationList = []
        for result in resultList:
            parser.clear()
            parser.feed(result)
            companyList.append(parser.getCompany())
            locationList.append(parser.getLocation())
        return companyList, locationList


class LocationHTMLParser(HTMLParser.HTMLParser):

    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.count = 0
        self.company = ''
        self.location = ''

    def handle_starttag(self, tag, attrs):
        if tag == 'td':
            self.count += 1

##    def handle_endtag(self, tag):
##        pass

    def handle_data(self, data):
        if self.count == 4:
            self.company += data
        elif self.count == 5:
            self.location += data

    #
    # Note to self:  Do not use name "reset".  HTMLParser
    #   defines and uses that.
    def clear(self):
        self.count = 0
        self.company = ''
        self.location = ''

    def getCompany(self):
        return self.company

    def getLocation(self):
        return self.location

Explanation:

We use urllib to retrieve a Web page from Jobs.com. The Web page is a "brief view" of a list of jobs. I learned how to form the URL for this request by using my Web browser, doing a search, then clicking on "Brief" view, and then copying the URL from the browser's address field.
Now look at method extract_with_parser. It uses sgrep to extract chunks of HTML mark-up which have a "tr" element at the outer-most level and contain a sequence of "td" elements. Each of these chunks is separated by three colons (":::", see the output formatting option "-o").
Then, the line containing``results.split(':::')`` separates these chunks of mark-up into a list for further processing ...
Next, method extract_with_parser applies an instance of LocationHTMLParser to each chunk, after which it retrieves the company name and job location from that instance.
Class LocationHTMLParser is a sub-class of HTMLParser. LocationHTMLParser counts the occurances of the "td" start tag. It saves the data from the 4th "td" element as the company. It saves the data from the 5th "td" element as the location.
Notice how I've separated the code that retrieves the Web page and the code that does data extraction into separate methods. That gives us a bit more flexibility to do things like calling this code from a Quixote user interface (e.g. a PTL file), from a command-line harness, or from a unit test harness.

And, here is the code that provides the Quixote Web user interface and that generates the Web page:

class ServicesUI:
    o
    o
    o

    def do_job_search [html] (self, request):
        queryString = widget.StringWidget('query_string')
        submit = widget.SubmitButtonWidget(value='Search')
        if request.form:
            queryStringValue = queryString.parse(request)
        else:
            queryStringValue = ''
        jobservice = services.JobService()
        urlList, descriptionList, companyList, locationList = jobservice.job_search(
            str(queryStringValue))
        header('Jobs')
        '<form method="POST" action="job_search">\n'
        '<p>Query string:'
        queryString.render(request)
        '</p>\n'
        '<p>'
        submit.render(request)
        '</p>\n'
        '</form>\n'
        '<hr/>\n'
        '<ul>\n'
        re1 = re.compile(str('href="([^"]*)"'))
        q1 = queryStringValue.replace(str(' '), str('.'))
        for idx in range(len(urlList)):
            result = urlList[idx]
            description = descriptionList[idx]
            company = companyList[idx]
            location = locationList[idx]
            if result.strip():
                mo = re1.search(result)
                if mo:
                    url = mo.group(1)
                    '<li><a href="http://%s.jobs.com%s">%s</a> at %s in %s</li>' % \
                        (q1, url, description, company, location)
        '</ul>'
        footer()

Explanation:

We use the Quixote widget module to create a simple user interface that enables the user to enter a job query.
We call the job_search method in the JobService class to extract the needed data items: URL, job description, company name, and job location.
And then we use standard Quixote PTL to format the data items in the Web page.

8ï¿½ï¿½ï¿½An HTML Scraping Development Methodology

This section is part summary/review and part suggestion for a sequence of steps to follow for this kind of work.

For each HTML scraping operation, do the following:

Determine and capture the URL -- Use your Web browser to visit the page containing the data you want. Then copy the contents of the address field. Note that in you Python code, you may have to replace arguments in the URL.
Capture a sample of the Web page in a file. Here is a simple script that retrieves a Web page and writes it to stdout, which you can pipe to the file of your choice:
```
import urllib
 
def get_page(url):
    f = urllib.urlopen(url)
    content = f.read()
    f.close()
    print content
```

Off-line (i.e. outside of Quixote), write and test a script that extracts the data items you want from this sample page you have captured in a file. Here is a harness that you can use to test your data extraction scripts. It is actually taken from a unit test, and it tests the code in the example above:

from test2.services import jobsearchservices

class TestDataExtraction(unittest.TestCase):
    o
    o
    o
    def test_retrieve_and_parse(self):
        jobservice = jobsearchservices.JobService()
        urlList, descriptionList, companyList, locationList = \
            jobservice.job_search('python internet')
        self.assert_(len(urlList) > 0)
        self.assert_(len(urlList) == len(descriptionList))
        self.assert_(len(urlList) == len(companyList))
        self.assert_(len(urlList) == len(locationList))

Copy and paste the data extraction function that you have just tested into your Quixote application. Or, with a little prior planning, you could put these data extraction functions into a Python module where they can be used both during off-line testing and from within your Quixote application.
Use the Python unittest framework to set up tests for your data extraction functions.

8.1ï¿½ï¿½ï¿½Unit tests

Good development methodology is -- First, documentation. Next, unit tests. Then code.

Here is a sample unit test that you can use to get started:

#!/usr/bin/env python

import unittest
from test2.services import jobsearchservices

class TestDataExtraction(unittest.TestCase):
    
    def setUp(self):
        pass

    def test_retrieve_and_parse(self):
        jobservice = jobsearchservices.JobService()
        urlList, descriptionList, companyList, locationList = \
            jobservice.job_search('python internet')
        self.assert_(len(urlList) > 0)
        self.assert_(len(urlList) == len(descriptionList))
        self.assert_(len(urlList) == len(companyList))
        self.assert_(len(urlList) == len(locationList))

if __name__ == '__main__':
    unittest.main()

Explanation:

This sample runs one simple test and verifies that we extract at least one data item and that the lengths of the returned lists of extracted data items are equal.
Because our HTML data retrieval and extraction code is in a separate module (from the rest of our Quixote application), we can import and test it outside of the Quixote application.
In order to add additional tests, add additional methods whose names begin with the letters "test".
For more information on the Python unit testing framework, see: http://www.python.org/doc/current/lib/module-unittest.html.

9ï¿½ï¿½ï¿½Summary

9.1ï¿½ï¿½ï¿½Hints and suggestions

Here are several suggestions for the implementation of your HTML data access:

Separate the user interface from the model -- Try to put your back-end access code in a separate module that contains no Quixote code. There are several reasons for doing this: (1) if you do so, it is good evidence of a separation of your user interface from your model (the application logic) and (2) doing so will allow you to run the code from outside of Quixote (e.g. from unit tests written with the Python unit test framework).
Define an API -- Try to write your code in a way that makes the access to a set of resources well-defined. One way to accomplish this is (1) to implement a separate function or method for each Web resource access and (2) to specify, for each function/method both the URL and the data returned. Specifying the URL may require describing arguments that are filled into the URL, for example CGI variables. Specifying the data returned may require describing a complex structure, for example a list of lists.

9.2ï¿½ï¿½ï¿½Motivation and tools

The Web is a huge resource. All we lack is the motivation and tools to make use of it. Or do we ...

10ï¿½ï¿½ï¿½See Also

http://www.mems-exchange.org/software/quixote/: The Quixote support Web site.

Sgrep home page

sgrep - search a file for a structured pattern: The sgrep man page.

re -- Regular expression operations: Documentation on Python's regular expression module.

PySgrep - Python Wrappers for Sgrep: Detailed information on the Python extension module for sgrep.

unittest: The Python Unit testing framework