Special characters

Special characters
	Chapter 20. Languages, characters and encoding

XML is based on Unicode, which contains thousands of characters and symbols. A given XML document specifies its encoding, which is a mapping of characters to bytes. But not all encodings include all Unicode characters. Also, your keyboard may not enable you to directly enter all characters in the encoding. Any characters you cannot enter directly are entered as entities, which consists of an ampersand, followed by a name, followed by a semicolon.

There are two kinds of character entities:

numerical character references: An entity name that consists of a # followed by the Unicode number for the character, such as á. The number can be expressed as a decimal number such as &#431, or a hexadecimal number (which is indicated using x as a prefix), such as ".
named character entities: A readable name such as ™ can be assigned to represent any Unicode character.

The following table shows examples of some characters expressed in both kinds of entities.

Table 20.2. Examples of character references

Character	Decimal character reference	Hexadecimal character reference	Named entity
á	`á`	`á`	`á`
ß	`ß`	`ß`	`ß`
©	`©`	`©`	`©`
¥	`¥`	`&#x00A5`	`¥`
±	`±`	`±`	`±`
✓	`⇒`	`✓`	`&check;`

Note

Leading zeros can be omitted in numerical references. So á is the same as á.

The set of “numerical entities” (character references) is defined as part of the Unicode standard adopted by XML, but the names used for named entities are not standardized. There are several standard named character entity sets defined by ISO, however, and these are incorporated into the DocBook DTD. Among the collection of files in the DocBook XML DTD distribution, there is an ent subdirectory that contains a set of files that declare named entities. Each declaration looks like the following:

<!ENTITY  plusmn  "&#x00B1;">    <!-- PLUS-MINUS SIGN -->

This declaration assigns the numerical character reference ± to the plusmn entity name. Either ± or ± can be used in your DocBook document to represent the plus-minus sign.

Note

If you use DocBook named character entities in your document, you must also make sure your document's DOCTYPE properly identifies the DocBook DTD, and the processor must be able to find and load the DTD. If not, then the entities will be considered unresolved and the processing will fail.

Use this reference to look up DocBook entities that you need in your documents:

DocBook Character Entity Reference

If you are a user of the Emacs text editor, then you might want to check out Norm Walsh's Emacs extensions for DocBook, which includes a selector for special characters. See http://nwalsh.com/emacs/xmlchars/.

A Unicode reference such as Unicode Code Charts can be used to look up numerical character references. Most online references use PDF to display the characters because most browsers cannot display all of Unicode.

Special characters in output

When an XSLT processor reads an entity from the input stream, it tries to resolve the entity. If it is a numerical character reference, it converts it to a Unicode character in memory. If it is a named entity, it checks the list of entity names that were loaded from the DTD. For the DocBook named character entities, these resolve to numerical character references. Then these numerical character references are also converted to Unicode characters in memory. All the characters in the document are handled as Unicode characters in memory during processing.

See the section “Output encoding” for a description of how special characters in memory are converted to output characters.

Space characters

The Unicode standard includes several characters that represent spaces that serve different purposes. Below is a table that summarizes most of them.

Table 20.3. Space characters

Unicode name	Character reference	DocBook entity	Description
SPACE	` `		Ordinary space.
NO-BREAK SPACE	` `	` `	Space that may not be broken at the end of a line.
NARROW NO-BREAK SPACE	`&#X202F;`		Thinner than NO-BREAK SPACE.
EN QUAD	` `		Same as EN SPACE.
EM QUAD	` `		Same as EM SPACE
EN SPACE	` `	`&ensp;`	Half an EM SPACE.
EM SPACE	` `	`&emsp;`	Usually a space equal to the type size in points.
THREE-PER-EM SPACE	` `	`&emsp13;`	One-third of an EM SPACE, called a thick space.
FOUR-PER-EM-SPACE	` `	`&emsp14;`	One-fourth of an EM SPACE, called a mid space.
SIX-PER-EM-SPACE	` `		One-sixth of an EM SPACE, similar to thin space.
FIGURE SPACE	` `	`&numsp;`	Width of a digit in some fonts.
PUNCTUATION SPACE	` `	`&puncsp;`	Space equal to the narrow punctuation of a font.
THIN SPACE	` `	` `	One-fifth of an EM SPACE, usually.
HAIR SPACE	` `	`&hairsp;`	Thinner than a thin space.
ZERO WIDTH SPACE	``		No visible space. Used to allow a line break in a word without generating a hyphen.
ZERO WIDTH NO-BREAK SPACE	``		No visible space. Used to prevent a line break, as for example, after the hyphen in a word that contains its own hyphen.

Not all of these characters may work in the various output forms. For example, the characters in the Unicode range   to   represent spaces of different widths. However, many print fonts do not support all of the characters in that range. So the print stylesheet uses a set of templates in the stylesheet module fo/spaces.xsl to convert them to fo:leader elements of different lengths. The length unit used for the leaders is em, which scales to the current font size. If you need to customize the width of any of those special space characters, you can change the stylesheet parameters that are defined in that module.

In HTML output, these characters will pass through the stylesheet and be rendered in the output encoding for the HTML file. However, some browsers do not support all of these space characters, and may display a placeholder character instead of the intended space.

Missing characters

When a DocBook document is processed with an XSLT processor, you may find that some or all special characters are missing in the output. There are several possible causes for this problem.

If only one entity is not resolving, it may simply be entered wrong. If you misspell a named entity, it will not resolve. If you enter a numerical character reference wrong, the number may resolve to a range in Unicode that does not have printable characters. Also, if you intend to enter a hexadecimal value, be sure to include the x prefix or it will be interpreted as a decimal value.
If you are using named character entities, the DocBook XML DTD must be available to the processor. That's because the named entities are defined in the DTD, and the processor will not know what the names mean unless it can load the DTD. Most XSLT processors do not do full validation, but they do load the entities defined in the DTD. If the DTD is not available, the processor may continue processing the document as a well-formed document, but it will not be able to resolve the named character entities.
If the output encoding does not include the character, the XSLT processor should convert it to a numerical character entity. Then the downstream viewer or processor must be able to handle such entities. For example, old browsers may not recognize the full range of numerical character references in Unicode.
The output medium may not have a font loaded that can display a special character. For example, a PDF file that does not contain embedded fonts relies on the system to supply requested fonts. If the font in use does not have a given character, it may not show up in the display or may display as #. If you are using Windows, you can use the Character Map application to investigate whether a given font has a certain Unicode character. With HTML, when a system viewing an HTML file does not have a screen font installed for a given encoding, in may not display all characters.
If you are doing XSL-FO processing, then the font currently being used may not contain all special characters. See the section “FO font-family list” to configure your output to use the font-selection-strategy property to search multiple fonts for a character. If you are using FOP, then this scheme does not work (as of version 0.93) because it does not support the font-selection-strategy property. See the section “Switching to Symbol font” for a workaround for this problem.

FO font-family list

A given font has a fixed character set that may not include all special characters. The XSL-FO standard has a font-selection-strategy property that lets the processor search through a list of font-families to find a given character. By default, the DocBook XSL stylesheets set font-selection-strategy="character-by-character" which enables that feature.

To specify the fonts for the processor to search, a font-family attribute in the FO output must contain a comma-separated list of font family names to search. The DocBook FO stylesheet automatically generates a list of fonts wherever the body font is called for. That list is made up of the value of the body.font.family parameter and the symbol.font.family parameter. So the default list is serif,Symbol,ZapfDingbats, and it is stored in the internal body.fontset parameter. A similar list is created using the title.font.family parameter and stored in the internal title.fontset parameter. If a special character is not in the body font, then the Symbol font is searched, and then the ZapfDingbats font. You can expand that list by adding more font names to the symbol.font.family parameter, assuming those extra fonts are configured into your processor.

Since XML supports any Unicode character, you may want to add a Unicode font to the list to catch any characters not in any of the others. On Windows, the Lucida Sans Unicode font is available, and the Arial Unicode MS font can be downloaded for free. You can add the font by modifying the symbol.font.family as follows:

<xsl:param name="symbol.font.family" select="'Symbol,ZapfDingbats,Lucida Sans Unicode'"/>

HTML encoding

For a browser to know what encoding an HTML file is written in, it must be told. If a browser is not told, then it may guess wrong and render many characters wrong. The encoding can be communicated to the browser in two ways for HTML files:

A META tag embedded in the HTML HEAD element of the file:

<meta  content="text/html;  charset=UTF-8"  http-equiv="Content-Type">

An encoding instruction in the HTTP header that accompanies the HTML file. The HTTP header is not in the HTML, but is sent by the HTTP server before the file and is hidden from the viewer. It provides information to the browser about the file.
```
Content-type: text/html; charset=UTF-8
```

It is best if both of these methods are used, and of course they must both agree with the actual file encoding. Often you do not have control of the HTTP header, so using the META element is the only option.

For XHTML output, there is a third avenue to convey the encoding. Because the output is XML, it should have an XML declaration, and the declaration should also contain the encoding:

<?xml version="1.0"  encoding="UTF-8" ?>

Odd characters in HTML output

If you are seeing odd accented characters when you browse your HTML output, then you probably have an encoding problem. You are seeing special characters that are encoded one way being misinterpreted by the browser as a different encoding. For example, if a file is encoded as UTF-8 and the browser thinks it is ISO-8859-1, then a special character such as an em-dash will appear as “â€” in the browser. More commonly, a nonbreaking space character in UTF-8 will appear as a “Â ” when viewed as ISO-8859-1.

The previous section describes how to set the encoding in the HTML. But if the HTTP server delivering the HTML gets the encoding wrong, then the browser may use the HTTP header instead of the hints within the HTML. If the odd characters become normal when the HTML file is browsed as a local file, then it is likely an HTTP server issue. For example, an Apache server might have an AddDefaultCharSet directive that sets the default encoding for all files to iso-8859-1. If you cannot fix the Apache server configuration, then you could try adding a .htaccess file to your HTML directory and add your own AddDefaultCharSet directive to it. See the Apache documentation for more details.

Switching to Symbol font

If you are generating PDF output and you find certain special characters are missing, then the problem might be that the body font does not contain the character you need. Many special characters such as math symbols are in the Symbol font, and are not in Times or Helvetica. Your FO processor may not be able to switch to the Symbol font for a single character.

The following is a customization that you can use to coerce a special character into the Symbol font.

<xsl:template match="symbol[@role = 'symbolfont']">
  <fo:inline font-family="Symbol">
    <xsl:call-template name="inline.charseq"/>
  </fo:inline>
</xsl:template>

With this added to your customization layer, you can mark up a special character such as ≤ (≤, less than or equal to) to be coerced into the Symbol font with <symbol role="symbolfont">≤</symbol>. The customization wraps a fo:inline element around the character to explicitly switch in and out of Symbol font for that character.


Output encoding		Language support