=======================================
HTML Screen Scraping: A How-To Document
=======================================

:author: Dave Kuhlman
:address: dkuhlman@rexx.com
    http://www.rexx.com/~dkuhlman

:revision: 1.0a
:date: Jan. 9, 2004

:copyright: Copyright (c) 2004 Dave Kuhlman. This documentation is
    covered by The MIT License:
    http://www.opensource.org/licenses/mit-license.

:abstract: This document explains how to do HTML screen scraping.
    In effect it shows how to treat the Web as a resource by
    enabling you to retrieve and extract data from HTML Web pages.

.. sectnum::    :depth: 4

.. contents::   :depth: 4


Introduction
============

The Web contains a huge amount of information.  This document
shows how to use the Web as a back-end resource behind your
Quixote Web applications.

This document explains how to do this behind your Quixote
application (although similar techniques could be used in other
environments as well).  In particular, we will describe:

- The use of urllib to retrieve Web pages.

- How to use ``sgrep``, a structure grep, to search and extract
  chunks from HTML Web pages.

- The use of regular expressions and the Python ``re`` module to
  extract data from the chunks of text returned by ``sgrep``.

- And, where the chunks returned by ``sgrep`` contain sufficiently
  complex HTML mark-up, we consider the use of the Python
  ``HTMLParser`` module to analyze and extract data from chunks of
  HTML mark-up.

There is a distribution file containing the source code from which
the samples in this document were selected.  You can find it here:
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.zip.

The Web as a back-end resource for your Quixote applications is
just one of a number of resources that you can put behind your
Quixote application.  Other possible back-end resources are, for
example relational databases, XML-RPC, SOAP, and Python extension
modules.  For more information on how to access several other
back-end resources under
Quixote, see `Special Tasks -- Back-end Resources`_ in `A Quixote
Application: Getting Started`_.

.. _`Special Tasks -- Back-end Resources`:
    http://www.rexx.com/~dkuhlman/quixote_appgetgo.html#special-tasks-back-end-resources

.. _`A Quixote Application: Getting Started`:
    http://www.rexx.com/~dkuhlman/quixote_appgetgo.html


Command-line Use
================

It is often useful to be able to make quick tests of your patterns
by running them from the command-line. Here is a simple function
that does so.  It uses the ``urllib`` module to retrieve the Web
page and the ``popen2`` module to run ``sgrep``::

    import urllib
    import popen2

    #
    # Scan the contents of a URL.
    # Retrieve the URL, then feed the contents (a string) to sgrep.
    #
    def search_url(pattern, url, addOptions):
        if not addOptions:
            addOptions = ''
        options = "-g html -o '%r:::' -T " + addOptions
        cmd = "sgrep %s '%s' -" % (options, pattern)
        print 'cmd:', cmd
        try:
            instream = urllib.urlopen(url)
        except IOError:
            print '*** bad url: %s' % url
            return
        content = instream.read()
        instream.close()
        print 'len(content): %d' % len(content)
        # Feed the content through sgrep and collect the results.
        outfile, infile = popen2.popen2(cmd)
        infile.write(content)
        infile.close()
        results = outfile.read()
        outfile.close()
        print 'results:\n========\n'
        resultlist = results.split(':::')
        for result in resultlist:
            if result.strip():
                print result
                print '---------------'

Explanation:

- Module ``popen2`` is the mechanism that we use to feed content
  to ``sgrep``. We write the Web page content to the input file
  returned by ``popen2``. And, we read from the output file
  returned by ``popen2`` to retrieve the output of the ``sgrep``
  command.

- Notice that we use ``sgrep's`` output formatting option ("-o")
  to insert a special character sequence (in the above example
  ":::") after each of the results returned by ``sgrep``.  Doing
  so enables us to split the returned result into a list of
  individual values.  This is especially handy when the individual
  values themselves contain new-line characters.


From Within Quixote
===================

OK, I'll admit it.  Doing this *inside* Quixote is basically the
same as doing it *outside* of Quixote.  Again, you are going to
use ``urllib`` to retrieve a Web page, then use ``sgrep`` to
extract chunks of text from the page, and finally use the Python
regular expression module ``re`` (or some other Python parsing
technique) to extract data items from the results returned by
``sgrep``.

One concern that we might have is latency.  In particular, we
might worry that:

1. Our server might block while our Quixote application is waiting
   for a request (e.g. through ``urllib``) to be completed.

2. The client might notice an extra delay while our Quixote
   application is retrieving a Web page.

Here is an attempt to ease your worries:

1. The possibility of blocking during a request is eliminated by
   using the SCGI server for Quixote.  This server creates a new
   process for each request if an existing process is not
   available to handle the request.  Because these request
   handlers are running in separate processes, they do not block
   each other.  This should reassure even those of you who worry
   about the Python global interpreter lock (GIL).  And, the
   maximum number of processes can be increased at server start-up
   time.

2. A noticable delay for the client can't be helped.  Some things
   take time.  In some applications you may be able to cache
   results.


Sgrep Patterns
==============

This section gives help with writing ``sgrep`` patterns that
select data within an HTML document. It contains examples of
``sgrep`` patterns for typical data extraction tasks that you are
likely to want to perform.

A few examples:

- Extract elements directly containing an attribute "HREF" whose
  value contains "python"::

      elements parenting (attribute("HREF") containing "python")

- Extract attribute values for attribute "HREF" whose values
  contains "python"::

      attribute("HREF") containing attvalue("*")

- Extract data content of an element containing an attribute
  "HREF" whose value contains "python"::

      stag("A") containing (attribute("HREF") containing "python") __ etag("A")

  Note that the "__" (double underscore) operator selects a
  non-inclusive region.  In this case, it selects a region that
  does *not* include the start and end tags.  The ".." (double
  dot) operator, by contrast, selects an *inclusive* region.
  Replacing "__" with ".." would have returned the data content
  *as well as* the "A" tags that surround it.

- Extract the attribute value for all "HREF" attributes::

      attvalue("*") in attribute("HREF")

- Extract all "TR" elements which contain a "TD" element that
  immediately contains an "A" element.

      (stag("TR") .. etag("TR")) containing (stag("TD") .. etag("TD"))
      parenting (stag("A") .. etag("A"))

  But, see note about problems with using more that one
  "parenting" operator, below.


A few additional comments and notes:

- The README file in the ``sgrep`` distribution has more sample
  patterns as well as documentation on the latest features.  And,
  documentation on the ``sgrep`` query language is at:
  http://www.cs.helsinki.fi/u/jjaakkol/sgrepman.html.

- Use the "-g html" option to tell ``sgrep`` that it should
  process HTML content.

- Using more that one "parenting" operator seems to cause
  problems.  You will have to try it yourself.  Feature or bug?

- Indexing -- If caching Web pages on your local machine works for
  your application and you decide to do so, look at the indexing
  operations and options supported by ``sgrep``, which can speed
  up searches.


Running sgrep
=============

There are two techniques for running ``sgrep``:

- Use ``pysgrep`` -- ``pysgrep`` is a Python extension module that
  enables structured searches from within Python.

- Use module ``popen2`` to run ``sgrep`` at the command-line.

The use of ``pysgrep`` is described in `PySgrep - Python Wrappers
for Sgrep`_, so I will not repeat that here.

.. _`PySgrep - Python Wrappers for Sgrep`:
    http://www.rexx.com/~dkuhlman/pysgrep.html

Using ``popen2`` is simple and will be used in the examples below.

Comparison -- ``PySgrep`` vs. ``popen2``:

- I'm concerned that, since ``sgrep`` itself was designed for
  command-line use and exits after each query, there may be memory
  leaks that I haven't discovered and that may impact ``pysgrep``.
  If anyone gains experience that confirms or disproves this
  worry, please send me an account of it.

- Since using ``popen2`` starts a separate process, I'm guessing
  that it might be slower.  If speed is a special worry for you,
  then investigate ``pysgrep``.  On the other hand, in comparison
  with the latency caused by retrieving a Web page (e.g. with
  ``urllib``), the time consumed by ``popen2`` might be
  insignificant.


Sgrep plus regular expressions
==============================

``sgrep`` does not (yet) have the capability to use regular
expressions.  You can get some of the effect of regular
expressions by using Python's ``re`` module and applying it to
results produced by ``sgrep``.

Basically we are going to use a regular expression to extract
pieces of data from each of the chunks returned by ``sgrep``.

An example -- Suppose that we want to extract the server and
domain name from URLs.  For example, suppose ``sgrep`` returns
something like the following::

    http://www.python.org/doc/current/tut/tut.html

And, we want to extract::

    www.python.org

Here is how we might do that::

    #
    # Scan files.
    # Read the files, then feed the file contents (a string) to sgrep.
    #
    def search_files(pattern, filenames, addOptions, regex):
        if not addOptions:
            addOptions = ''
        options = "-g html -o '%r:::' " + addOptions
        #
        # Compile the regular expression if there is one.
        expr = None
        if regex:
            expr = re.compile(regex)
        cmd = "sgrep %s '%s' -" % (options, pattern)
        for filename in filenames:
            inputfile = file(filename, 'r')
            outfile, infile = popen2.popen2(cmd)
            infile.write(inputfile.read())
            infile.close()
            results = outfile.read()
            outfile.close()
            print '=' * 50
            s1 = 'file: %s' % filename
            print s1
            print '=' * len(s1)
            resultlist = results.split(':::')
            for result in resultlist:
                if result.strip():
                    print result
                    #
                    # If there is a regular expression, use it to
                    #   search the result.
                    if expr:
                        matchobject = expr.search(result)
                        if matchobject:
                            print 'match: %s' % matchobject.group(1)
                        else:
                            print 'no match'
                    print '---------------'

Explanation:

- If this function is passed a regular expression, then it
  compiles the regular expression and uses it to search each
  result returned by ``sgrep``.

- If the regular expression search succeeds (i.e. if
  "expr.search(result)" returns a match object, then we retrieve
  the first group from that match object.  We are assuming that
  the regular expression contains at least one group (in
  particular that it has at least one set of parentheses).

- In order to extract the server and domain from a URL, we might
  pass this function the following regular expression::

      'https?://([^/]*)'

- For more on the Python regular expressions, see `re -- Regular
  expression operations`_.

.. _`re -- Regular expression operations`:
    http://www.python.org/doc/current/lib/module-re.html


Sgrep plus HTMLParser
=====================

In some cases the chunks of text returned by ``sgrep`` may contain
HTML mark-up that is sufficiently complex so that it becomes very
awkward to use regular expressions to analyze it.   In other
cases, it just may be easier to ask ``sgrep`` to return such
chunks of HTML mark-up.  This section describes how to use
the Python ``HTMLParser`` module to analyze these chunks of text.

This technique is limited to some extent by the need to give the
``HTMLParser.feed()`` chunks of mark-up that are "complete", that
is the contain balanced tags.  However, this requirement is easy
to satisfy with ``sgrep``, because any query of the form
"(stag("tag") .. etag("tag")) parenting ..." will return a chunk
of HTML mark-up that we can give to the ``HTMLParser.feed(data)``
method.  And, note that there is no requirement to feed a
"complete" chunk of mark-up to ``HTMLParser.feed(data)`` in a
single call; we can call ``feed`` multiple times in order to do
so.

Here is a reasonably simple example.  This example searches
http://jobs.com with a query, then extracts (1) a brief job
description, (2) a URL, (3) the company name, and (4) the job
location, then formats a Web page with this extracted information.

Here is the code that does the query and data extraction::

    class JobService:

        #
        # Process a single query.
        # Return a tuple: (urlList, descriptionList, companyList, locationList).
        # The query is a sequence of words separated by spaces.
        #
        def job_search(self, query):
            if query:
                q1 = query.replace(' ', '.')
                q2 = query.replace(' ', '%26')
                q3 = query.replace(' ', '+')
            else:
                return [], [], [], []
            req = 'http://%s.jobs.com/jobsearch.asp?re=9&vw=b&pg=1&cy=US&sq=%s&aj=%s' \
                % (q1, q2, q3)
            f = urllib.urlopen(req)
            content = f.read()
            f.close()
            resultTuple = self.job_parse(content)
            return resultTuple

        def job_parse(self, content):
            # Extract the URLs.
            cmd = "sgrep -g html -o '%r:::' 'attribute(\"HREF\") in " \
                  "((stag(\"A\") .. etag(\"A\")) childrening " \
                  "(stag(\"TD\") .. etag(\"TD\")) containing " \
                  "attribute(\"HREF\") containing \"getjob\")' -"
            urlList = self.extract(cmd, content)
            # Extract the descriptions.
            cmd = "sgrep -g html -o '%r:::' '(stag(\"A\") __ etag(\"A\")) in " \
                  "((stag(\"A\") .. etag(\"A\")) childrening " \
                  "(stag(\"TD\") .. etag(\"TD\")) containing " \
                  "attribute(\"HREF\") containing \"getjob\")' -"
            descriptionList = self.extract(cmd, content)
            # Extract the company names and locations.
            cmd = "sgrep -g html -o '%r:::' '(stag(\"TR\") .. etag(\"TR\")) containing " \
                  "(stag(\"TD\") .. etag(\"TD\")) parenting " \
                  "stag(\"A\") containing " \
                  "attribute(\"HREF\") containing \"getjob\"'"
            companyList, locationList = self.extract_with_htmlparser(cmd, content)
            return urlList, descriptionList, companyList, locationList

        def extract(self, cmd, content):
            outfile, infile = popen2.popen2(cmd)
            infile.write(content)
            infile.close()
            results = outfile.read()
            outfile.close()
            resultList = results.split(':::')
            return resultList

        def extract_with_htmlparser(self, cmd, content):
            parser = LocationHTMLParser()
            outfile, infile = popen2.popen2(cmd)
            infile.write(content)
            infile.close()
            results = outfile.read()
            outfile.close()
            resultList = results.split(':::')
            companyList = []
            locationList = []
            for result in resultList:
                parser.clear()
                parser.feed(result)
                companyList.append(parser.getCompany())
                locationList.append(parser.getLocation())
            return companyList, locationList


    class LocationHTMLParser(HTMLParser.HTMLParser):

        def __init__(self):
            HTMLParser.HTMLParser.__init__(self)
            self.count = 0
            self.company = ''
            self.location = ''

        def handle_starttag(self, tag, attrs):
            if tag == 'td':
                self.count += 1

    ##    def handle_endtag(self, tag):
    ##        pass

        def handle_data(self, data):
            if self.count == 4:
                self.company += data
            elif self.count == 5:
                self.location += data

        #
        # Note to self:  Do not use name "reset".  HTMLParser
        #   defines and uses that.
        def clear(self):
            self.count = 0
            self.company = ''
            self.location = ''

        def getCompany(self):
            return self.company

        def getLocation(self):
            return self.location

Explanation:

- We use ``urllib`` to retrieve a Web page from ``Jobs.com``.  The
  Web page is a "brief view" of a list of jobs.  I learned how to
  form the URL for this request by using my Web browser, doing a
  search, then clicking on "Brief" view, and then copying the URL
  from the browser's address field.

- Now look at method ``extract_with_parser``.  It uses ``sgrep``
  to extract chunks of HTML mark-up which have a "tr" element at
  the outer-most level and contain a sequence of "td" elements.
  Each of these chunks is separated by three colons (":::", see
  the output formatting option "-o").

- Then, the line containing``results.split(':::')`` separates
  these chunks of mark-up into a list for further processing ...
  
- Next, method ``extract_with_parser`` applies an instance of
  ``LocationHTMLParser`` to each chunk, after which it retrieves
  the company name and job location from that instance.

- Class ``LocationHTMLParser`` is a sub-class of HTMLParser. 
  ``LocationHTMLParser`` counts the occurances of the "td" start
  tag.  It saves the data from the 4th "td" element as the
  company.  It saves the data from the 5th "td" element as the
  location.

- Notice how I've separated the code that retrieves the Web page
  and the code that does data extraction into separate methods.
  That gives us a bit more flexibility to do things like calling
  this code from a Quixote user interface (e.g. a PTL file), from
  a command-line harness, or from a unit test harness.

And, here is the code that provides the Quixote Web user interface
and that generates the Web page::

    class ServicesUI:
        o
        o
        o

        def do_job_search [html] (self, request):
            queryString = widget.StringWidget('query_string')
            submit = widget.SubmitButtonWidget(value='Search')
            if request.form:
                queryStringValue = queryString.parse(request)
            else:
                queryStringValue = ''
            jobservice = services.JobService()
            urlList, descriptionList, companyList, locationList = jobservice.job_search(
                str(queryStringValue))
            header('Jobs')
            '<form method="POST" action="job_search">\n'
            '<p>Query string:'
            queryString.render(request)
            '</p>\n'
            '<p>'
            submit.render(request)
            '</p>\n'
            '</form>\n'
            '<hr/>\n'
            '<ul>\n'
            re1 = re.compile(str('href="([^"]*)"'))
            q1 = queryStringValue.replace(str(' '), str('.'))
            for idx in range(len(urlList)):
                result = urlList[idx]
                description = descriptionList[idx]
                company = companyList[idx]
                location = locationList[idx]
                if result.strip():
                    mo = re1.search(result)
                    if mo:
                        url = mo.group(1)
                        '<li><a href="http://%s.jobs.com%s">%s</a> at %s in %s</li>' % \
                            (q1, url, description, company, location)
            '</ul>'
            footer()

Explanation:

- We use the Quixote ``widget`` module to create a simple user
  interface that enables the user to enter a job query.

- We call the ``job_search`` method in the JobService class to
  extract the needed data items: URL, job description, company
  name, and job location.

- And then we use standard Quixote PTL to format the data items in
  the Web page.


An HTML Scraping Development Methodology
========================================

This section is part summary/review and part suggestion for a
sequence of steps to follow for this kind of work.

For each HTML scraping operation, do the following:

1. Determine and capture the URL -- Use your Web browser to visit
   the page containing the data you want.  Then copy the contents
   of the address field.  Note that in you Python code, you may
   have to replace arguments in the URL.

2. Capture a sample of the Web page in a file.  Here is a simple
   script that retrieves a Web page and writes it to stdout, which
   you can pipe to the file of your choice::

       import urllib
        
       def get_page(url):
           f = urllib.urlopen(url)
       	   content = f.read()
       	   f.close()
       	   print content

3. Off-line (i.e. outside of Quixote), write and test a script
   that extracts the data items you want from this sample page you
   have captured in a file.  Here is a harness that you can use to
   test your data extraction scripts.  It is actually taken from a
   unit test, and it tests the code in the example above::

       from test2.services import jobsearchservices

       class TestDataExtraction(unittest.TestCase):
           o
           o
           o
           def test_retrieve_and_parse(self):
               jobservice = jobsearchservices.JobService()
               urlList, descriptionList, companyList, locationList = \
                   jobservice.job_search('python internet')
               self.assert_(len(urlList) > 0)
               self.assert_(len(urlList) == len(descriptionList))
               self.assert_(len(urlList) == len(companyList))
               self.assert_(len(urlList) == len(locationList))

4. Copy and paste the data extraction function that you have just
   tested into your Quixote application.  Or, with a little prior
   planning, you could put these data extraction functions into a
   Python module where they can be used both during off-line
   testing and from within your Quixote application.

5. Use the Python ``unittest`` framework to set up tests for your
   data extraction functions.


Unit tests
----------

Good development methodology is -- First, documentation.  Next, unit
tests.  Then code.

Here is a sample unit test that you can use to get started::

       #!/usr/bin/env python

       import unittest
       from test2.services import jobsearchservices

       class TestDataExtraction(unittest.TestCase):
           
           def setUp(self):
               pass

           def test_retrieve_and_parse(self):
               jobservice = jobsearchservices.JobService()
               urlList, descriptionList, companyList, locationList = \
                   jobservice.job_search('python internet')
               self.assert_(len(urlList) > 0)
               self.assert_(len(urlList) == len(descriptionList))
               self.assert_(len(urlList) == len(companyList))
               self.assert_(len(urlList) == len(locationList))

       if __name__ == '__main__':
           unittest.main()

Explanation:

- This sample runs one simple test and verifies that we extract at
  least one data item and that the lengths of the returned lists
  of extracted data items are equal.

- Because our HTML data retrieval and extraction code is in a
  separate module (from the rest of our Quixote application), we
  can import and test it outside of the Quixote application.

- In order to add additional tests, add additional methods whose
  names begin with the letters "test".

- For more information on the Python unit testing framework, see:
  http://www.python.org/doc/current/lib/module-unittest.html.


Summary
=======

Hints and suggestions
---------------------

Here are several suggestions for the implementation of your HTML
data access:

- Separate the user interface from the model -- Try to put your
  back-end access code in a separate module that contains no
  Quixote code.  There are several reasons for doing this: (1) if
  you do so, it is good evidence of a separation of your user
  interface from your model (the application logic) and (2) doing
  so will allow you to run the code from outside of Quixote (e.g.
  from unit tests written with the Python unit test framework).

- Define an API -- Try to write your code in a way that makes the
  access to a set of resources well-defined.  One way to
  accomplish this is (1) to implement a separate function or
  method for each Web resource access and (2) to specify, for each
  function/method both the URL and the data returned.  Specifying
  the URL may require describing arguments that are filled into
  the URL, for example CGI variables. Specifying the data returned
  may require describing a complex structure, for example a list
  of lists.


Motivation and tools
--------------------

The Web is a huge resource.  All we lack is the motivation and
tools to make use of it.  Or do we ...


See Also
========

`http://www.mems-exchange.org/software/quixote/`_: The Quixote
support Web site.

.. _`http://www.mems-exchange.org/software/quixote/`:
    http://www.mems-exchange.org/software/quixote/

`Sgrep home page`_

.. _`Sgrep home page`:
    http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html

`sgrep - search a file for a structured pattern`_: The ``sgrep``
man page.

.. _`sgrep - search a file for a structured pattern`:
    http://www.cs.helsinki.fi/u/jjaakkol/sgrepman.html

`re -- Regular expression operations`_: Documentation on Python's
regular expression module.

.. _`re -- Regular expression operations`:
    http://www.python.org/doc/current/lib/module-re.html

`PySgrep - Python Wrappers for Sgrep`_: Detailed information on
the Python extension module for ``sgrep``.

.. _`PySgrep - Python Wrappers for Sgrep`:
    http://www.rexx.com/~dkuhlman/pysgrep.html

`unittest`_: The Python Unit testing framework

.. _`unittest`: http://www.python.org/doc/current/lib/module-unittest.html