 |   Moduleurn:org:ten60:netkernel:ext:xhtml
The
tagSoup
accessor is exported by the urn:org:ten60:netkernel:ext:xhtml module.
Import this module to gain access to the accessor.
Syntax
Argument | Rules | Description | operand | Mandatory | The Tag Soup Resource |
Example UsageDPML<instr> <type>tagSoup</type> <operand>legacy.html</operand> <target>this:response</target> </instr> NetKernel Foundation API
req=context.createSubRequest("active:tagSoup");
req.addArgument("operand", [resource representation, aspect, or URI] );
result=context.issueSubRequest(req); Purpose |
The tagSoup accessor converts badly formed HTML to clean XML.
Below is John Cowan's description of how the Tag Soup parser works...
See also XHTML Tidy.
TagSoup - Just Keep On Truckin'
Introduction
This is the home page of TagSoup, a SAX-compliant parser written in Java
that, instead of parsing well-formed or valid XML, parses HTML as it is
found in the wild: nasty and brutish, though quite often far from short.
TagSoup is designed for people who have to process this stuff using some
semblance of a rational application design. By providing a SAX interface, it
allows standard XML tools to be applied to even the worst HTML.
TagSoup is free and Open Source software, licensed under the [1]Academic
Free License, a cleaned-up and patent-safe BSD-style license which allows
proprietary re-use. It's also licensed under the [2]GNU GPL, since
unfortunately the GPL and the AFL are incompatible.
_________________________________________________________________
TagSoup 1.0 Release Candidate 1 released!
Tagsoup 1.0rc1 is the first release candidate for TagSoup 1.0. All
development work planned for 1.0 is complete. Bug reports are urgently
solicited so that I can get 1.0 released.
Improvements for this release include better JavaDoc and an extension to
TSSL, support for proper attribute normalization, namespace prefixes on
elements, and an expanded public API for schema components.
_________________________________________________________________
What TagSoup does
TagSoup is designed as a parser, not a whole application; it isn't intended
to permanently clean up bad HTML, as [3]HTML Tidy does, only to parse it on
the fly. Therefore, it does not convert presentation HTML to CSS or anything
similar. It does guarantee well-structured results: tags will wind up
properly nested, default attributes will appear appropriately, and so on.
The semantics of TagSoup are as far as practical those of actual HTML
browsers. In particular, never, never will it throw any sort of syntax
error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much
more. For example, if the first tag is LI, it will supply the application
with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers
assume in this situation. For the same reason, overlapping tags are
correctly restarted whenever possible: text like:
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
By intention, TagSoup is small and fast. Eventually, I will spend some time
making it faster if it turns out to be too slow. It does not depend on the
existence of any framework other than SAX, and should be able to work with
any framework that can accept SAX parsers. In particular, [4]XOM works.
You can replace the low-level HTML scanner with one based on Sean McGrath's
[5]PYX format (very close to James Clark's ESIS format). You can also supply
an AutoDetector that peeks at the incoming byte stream and guesses a
character encoding for it. (Otherwise, the platform default is used. If
someone supplies a good AutoDetector I may package it with later releases --
the Mozilla one is too big.)
_________________________________________________________________
The TSaxon XSLT-for-HTML processor
[6]I am also distributing [7]TSaxon, a repackaging of version 6.5.3 of
Michael Kay's Saxon XSLT implementation that includes TagSoup. TSaxon is a
drop-in replacement for Saxon, and can be used to process either HTML or XML
documents with XSLT stylesheets.
_________________________________________________________________
TagSoup as a stand-alone program
It is possible to run TagSoup as a program by saying java -Dswitch=true ...
-jar tagsoup-0.10.2 file .... Files mentioned on the command line will be
parsed individually. If no files are specified, the standard input is read.
The following switches are understood:
files
Output into individual files, with html extensions changed to xhtml.
Otherwise, all output is sent to the standard output.
html
Output is in clean HTML: the XML declaration is suppressed, as are end-tags
for the known empty elements.
pyx
Output is in PYX format.
nons
Namespaces are suppressed. Normally, all elements are in the XHTML 1.x
namespace, and all attributes are in no namespace.
nobogons
Bogons (unknown elements) are suppressed. Normally, they are treated as
empty.
any
Bogons are given a content model of ANY rather than EMPTY.
lexical
Pass through HTML comments. Has no effect when output is in PYX format.
reuse
Reuse a single instance of TagSoup parser throughout. Normally, a new one is
instantiated for each input file.
nocdata
Change the content models of the script and style elements to treat them as
ordinary #PCDATA (text-only) elements, as in XHTML, rather than with the
special CDATA content model.
_________________________________________________________________
More information
I gave a presentation (a nocturne, so it's not on the schedule) at Extreme
Markup Languages 2004 about TagSoup, updated from the one I presented in
2002 at the New York City XML SIG and at XML 2002. This is the main
high-level documentation about how TagSoup works. Formats: [8]OpenOffice.org
[9]Powerpoint [10]PDF.
[11]Download the TagSoup 1.0rc1 jar file here. It's about 39K long.
[12]Download the full TagSoup 1.0rc1 source here. If you don't have zip, you
can use jar to unpack it.
There is a [13]tagsoup-friends mailing list hosted at [14]Yahoo Groups. You
can [15]join via the Web, or by sending a blank email to
[16]tagsoup-friends-subscribe@yahoogroups.com. The [17]archives are open to
all.
_________________________________________________________________
Paid advertisement: [18]TopXML - XML tools
References
1. http://www.opensource.org/licenses/afl-2.1.php
2. http://www.opensource.org/licenses/gpl-license.php
3. http://tidy.sf.net/
4. http://www.cafeconleche.org/XOM
5. http://gnosis.cx/publish/programming/xml_matters_17.html
6. http://www.ccil.org/~cowan
7. http://mercury.ccil.org/~cowan/XML/tagsoup/tsaxon
8. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup.sxi
9. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
10. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
11. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc1.jar
12. http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc1-src.zip
13. http://groups.yahoo.com/group/tagsoup-friends
14. http://groups.yahoo.com/
15. http://groups.yahoo.com/group/tagsoup-friends/join
16. mailto:tagsoup-friends-subscribe@yahoogroups.com
17. http://groups.yahoo.com/group/tagsoup-friends/messages
18. http://www.topxml.com/
]
|