tagSouptagSoup
John Cowan's TagSoup Parser - converts 'HTML' to clean XML
Home > Books > NetKernel API and Services Reference > Accessor Listing > XHTML Utilities > tagSoup

Rate this page:
Really useful
Satisfactory
Not helpful
Confusing
Incorrect
Unsure
Extra comments:


Module

urn:org:ten60:netkernel:ext:xhtml

The tagSoup accessor is exported by the urn:org:ten60:netkernel:ext:xhtml module. Import this module to gain access to the accessor.

Syntax

URI
active:tagSoup

ArgumentRulesDescription
operandMandatory The Tag Soup Resource

Example Usage

DPML

<instr>
  <type>tagSoup</type>
  <operand>legacy.html</operand>
  <target>this:response</target>
</instr>

NetKernel Foundation API

req=context.createSubRequest("active:tagSoup");
req.addArgument("operand", [resource representation, aspect, or URI] );
result=context.issueSubRequest(req);

Purpose

The tagSoup accessor converts badly formed HTML to clean XML. Below is John Cowan's description of how the Tag Soup parser works...

See also XHTML Tidy.

		
                        TagSoup - Just Keep On Truckin'

  Introduction

   This is the home page of TagSoup, a SAX-compliant parser written in Java
   that, instead of parsing well-formed or valid XML, parses HTML as it is
   found in the wild: nasty and brutish, though quite often far from short.
   TagSoup is designed for people who have to process this stuff using some
   semblance of a rational application design. By providing a SAX interface, it
   allows standard XML tools to be applied to even the worst HTML.

   TagSoup is free and Open Source software, licensed under the [1]Academic
   Free License, a cleaned-up and patent-safe BSD-style license which allows
   proprietary  re-use.  It's  also  licensed under the [2]GNU GPL, since
   unfortunately the GPL and the AFL are incompatible.
     _________________________________________________________________

  TagSoup 1.0 Release Candidate 1 released!

   Tagsoup  1.0rc1  is  the  first release candidate for TagSoup 1.0. All
   development work planned for 1.0 is complete. Bug reports are urgently
   solicited so that I can get 1.0 released.

   Improvements for this release include better JavaDoc and an extension to
   TSSL, support for proper attribute normalization, namespace prefixes on
   elements, and an expanded public API for schema components.
     _________________________________________________________________

  What TagSoup does

   TagSoup is designed as a parser, not a whole application; it isn't intended
   to permanently clean up bad HTML, as [3]HTML Tidy does, only to parse it on
   the fly. Therefore, it does not convert presentation HTML to CSS or anything
   similar.  It does guarantee well-structured results: tags will wind up
   properly nested, default attributes will appear appropriately, and so on.

   The  semantics of TagSoup are as far as practical those of actual HTML
   browsers. In particular, never, never will it throw any sort of syntax
   error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much
   more. For example, if the first tag is LI, it will supply the application
   with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers
   assume  in  this  situation. For the same reason, overlapping tags are
   correctly restarted whenever possible: text like:
This is <B>bold, <I>bold italic, </b>italic, </i>normal text

   gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

   By intention, TagSoup is small and fast. Eventually, I will spend some time
   making it faster if it turns out to be too slow. It does not depend on the
   existence of any framework other than SAX, and should be able to work with
   any framework that can accept SAX parsers. In particular, [4]XOM works.

   You can replace the low-level HTML scanner with one based on Sean McGrath's
   [5]PYX format (very close to James Clark's ESIS format). You can also supply
   an  AutoDetector  that peeks at the incoming byte stream and guesses a
   character encoding for it. (Otherwise, the platform default is used. If
   someone supplies a good AutoDetector I may package it with later releases --
   the Mozilla one is too big.)
     _________________________________________________________________

  The TSaxon XSLT-for-HTML processor

   [6]I am also distributing [7]TSaxon, a repackaging of version 6.5.3 of
   Michael Kay's Saxon XSLT implementation that includes TagSoup. TSaxon is a
   drop-in replacement for Saxon, and can be used to process either HTML or XML
   documents with XSLT stylesheets.
     _________________________________________________________________

  TagSoup as a stand-alone program

   It is possible to run TagSoup as a program by saying java -Dswitch=true ...
   -jar tagsoup-0.10.2 file .... Files mentioned on the command line will be
   parsed individually. If no files are specified, the standard input is read.

   The following switches are understood:
   files

   Output into individual files, with html extensions changed to xhtml.
   Otherwise, all output is sent to the standard output.
   html

   Output is in clean HTML: the XML declaration is suppressed, as are end-tags
   for the known empty elements.
   pyx

   Output is in PYX format.
   nons

   Namespaces are suppressed. Normally, all elements are in the XHTML 1.x
   namespace, and all attributes are in no namespace.
   nobogons

   Bogons (unknown elements) are suppressed. Normally, they are treated as
   empty.
   any

   Bogons are given a content model of ANY rather than EMPTY.
   lexical

   Pass through HTML comments. Has no effect when output is in PYX format.
   reuse

   Reuse a single instance of TagSoup parser throughout. Normally, a new one is
   instantiated for each input file.
   nocdata

   Change the content models of the script and style elements to treat them as
   ordinary #PCDATA (text-only) elements, as in XHTML, rather than with the
   special CDATA content model.
     _________________________________________________________________

  More information

   I gave a presentation (a nocturne, so it's not on the schedule) at Extreme
   Markup Languages 2004 about TagSoup, updated from the one I presented in
   2002 at the New York City XML SIG and at XML 2002. This is the main
   high-level documentation about how TagSoup works. Formats: [8]OpenOffice.org
   [9]Powerpoint [10]PDF.

   [11]Download the TagSoup 1.0rc1 jar file here. It's about 39K long.

   [12]Download the full TagSoup 1.0rc1 source here. If you don't have zip, you
   can use jar to unpack it.

   There is a [13]tagsoup-friends mailing list hosted at [14]Yahoo Groups. You
   can [15]join via the Web, or by sending a blank email to
   [16]tagsoup-friends-subscribe@yahoogroups.com. The [17]archives are open to
   all.
     _________________________________________________________________

   Paid advertisement: [18]TopXML - XML tools

References

   1. http://www.opensource.org/licenses/afl-2.1.php
   2. http://www.opensource.org/licenses/gpl-license.php
   3. http://tidy.sf.net/
   4. http://www.cafeconleche.org/XOM
   5. http://gnosis.cx/publish/programming/xml_matters_17.html
   6. http://www.ccil.org/~cowan
   7. http://mercury.ccil.org/~cowan/XML/tagsoup/tsaxon
   8. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup.sxi
   9. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
  10. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
  11. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc1.jar
  12. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc1-src.zip
  13. http://groups.yahoo.com/group/tagsoup-friends
  14. http://groups.yahoo.com/
  15. http://groups.yahoo.com/group/tagsoup-friends/join
  16. mailto:tagsoup-friends-subscribe@yahoogroups.com
  17. http://groups.yahoo.com/group/tagsoup-friends/messages
  18. http://www.topxml.com/
  ]
		
© 2003-2007, 1060 Research Limited. 1060 registered trademark, NetKernel trademark of 1060 Research Limited.