Regardless of what the encoding is for your documents, an XSL engine can convert the output to a different encoding if you need it. When the document is loaded into memory, XML applications such as XSLT engines convert it to Unicode. The XSL engine then uses the stylesheet templates to create a transformed version of the content in memory structures. When it is done, it serializes the internal content into a stream of bytes that it feeds to the outside world. During the serialization process, it can convert the internal Unicode to some other encoding for the output.
An XSL stylesheet usually sets the output encoding in an xsl:output
element at the top of the stylesheet file. The following shows that element for the html/docbook.xsl
stylesheet:
<xsl:output method="html" encoding="ISO-8859-1" indent="no"/>
The encoding="ISO-8859-1"
attribute means all documents processed with that stylesheet are to be output with the ISO-8859-1 encoding. If a stylesheet's xsl:output
element does not have an encoding attribute, then the default output encoding is UTF-8
. That is what the fo/docbook.xsl
stylesheet for print output does.
When the output method="html"
, the XSLT processor also adds an HTML META
tag that identifies the HTML file's encoding:
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
When a browser opens the HTML file, it reads this tag and knows the bytes it finds in the file map to the ISO-8859-1 character set for display. What if the document contains characters that are not available in the specified output encoding? As with input, the characters are expressed as numerical character references such as ™
. It is up to the browser to figure out how to display such characters. Most browsers cover a pretty wide range of character entities, but there are so many that sometimes a browser does not have a way to display a given character.
Most modern graphical browsers can display HTML files encoded with UTF-8, which covers a much wider set of characters than ISO-8859-1. To change the output encoding for the non-chunking docbook.xsl
stylesheet, you have to use a stylesheet customization layer. That is because the XML specification does not permit the encoding attribute to be a variable or parameter value. Your stylesheet customization must provide a new <xsl:output>
element such as the following:
<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:import href="/path/to/
html/docbook.xsl"/>
<xsl:output method="html"
encoding="UTF-8"
indent="no"/>
</xsl:stylesheet>
This is a complete stylesheet customization that you can save in a file such as docbook-utf8.xsl
and use in place of the stock html/docbook.xsl
stylesheet. All it does is import the stock stylesheet and set a new output encoding, in this instance to UTF-8. Any HTML files generated with this stylesheet will have their characters encoded as UTF-8, and the file will include a meta tag like the following:
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
Changing the output encoding of the chunking stylesheet is much easier. It can be done with the chunker.output.encoding
parameter, either on the command line or in a
customization layer. That's because the chunking stylesheet uses
EXSLT extensions to generate HTML files. See
the section “Output encoding for chunk HTML” for more information.
If you are using the Saxon processor with the chunking stylesheet for non-English HTML output, then you may want to set the stylesheet parameter saxon.character.representation
to a value of 'native;decimal'
. By default, this parameter (which is defined in
html/chunker.xsl
) is set to 'entity;decimal'
. The default value of entity
before the semicolon means that any non-ASCII characters
within the encoding are converted to named entity references such as á
instead of the numerical character code for that encoding. For
example, when using the iso-8859-1
output encoding, this means one native character is
replaced by the 8 ASCII characters that form the named entity reference, which
makes your files considerably larger. When entity
is replaced with native
, the single character code of the encoding is
output. Note that when the output encoding is UTF-8
and the parameter value uses native
, then no entity references will be output because there are no XML characters outside of UTF-8.
The value after the semicolon controls how characters that are not in the encoding are output by Saxon. They must be converted to some kind of entity reference, and the value can be entity
(named entity reference such as á
if one exists), decimal
(decimal numerical character reference such as á
), or hex
(hexadecimal numerical character reference such as á
). Saxon outputs named entity references only for characters in ISO-8859-1, not for all DocBook named character entities.
If you are using the chunking stylesheet, then you can use this parameter to set the Saxon output character representation. If you are using the non-chunking stylesheet, then your customization of xsl:output
as described above needs to be enhanced as follows:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:saxon="http://icl.com/saxon" extension-element-prefixes="saxon"> <xsl:import href="file:///c:/docbook/xsl/html/docbook.xsl"/> <xsl:output method="html" encoding="UTF-8" indent="no" saxon:character-representation="native;decimal"/>
DocBook XSL: The Complete Guide - 4th Edition | PDF version available | Copyright © 2002-2007 Sagehill Enterprises |