Advanced topic: Improving the PDF

How the PDF is built

First you have to understand how the PDF is built. Contrary to the HTML generation, this is a two-step process:

The DocBook XML source is converted to a Formatting Objects (FO) file. FO – formally called XSL-FO – is also an XML format, but unlike DocBook it's presentation-oriented. This step is performed by a so-called XSL transformer called Saxon. The output goes into manual/inter/filename.fo.
Another tool, Apache FOP (Formatting Objects Processor), then picks up filename.fo and converts it to filename.pdf, which is stored in manual/dist/pdf.

If you give a build pdf command, two consecutive build targets are called internally: fo and fo2pdf, corresponding to the two steps described above. But you can also call them from the command line. For instance,

build fo -Drootid=qsg15

...transforms the 1.5 Quick Start Guide source to manual/inter/qsq15.fo. And

build fo2pdf -Drootid=qsg15

...produces the PDF from the FO file (which must of course be present for this step to succeed).

In fact, build pdf is just a shortcut for build fo followed by build fo2pdf.

This setup allows us to edit the FO file manually before generating the final PDF. And that's exactly what we're going to do to fix some of those nasty problems that can spoil our PDFs.

General repair scheme

The general procedure for improving the PDF output by editing the FO file is:

Build the PDF once as usual with build pdf [arguments].
Start reading the PDF and find the first trouble spot.
Open the FO file in an XML or text editor.
Find the location in the FO file that corresponds to the trouble spot in the PDF (we'll show you how later).
Edit the FO file to fix the problem (we'll show you how later), and save it.
Rebuild the PDF, but this time use build fo2pdf [arguments]. If you don't, you'll overwrite the changes you've just made to the FO file, get the same PDF as first, and have to start all over again.
Check if the problem is really solved and if so, find the next trouble spot in the PDF.
Repeat steps 4–7 until you've worked your way through the entire PDF.

Notes

Although this FO-editing approach suggests that the problem lies in the FO file, this is not the case. The FO file is all right, but Apache FOP doesn't support all the nice features in the XSL-FO specification (yet). With our manual editing, we force the PDF in a certain direction.
It is important to fix the problems in document order. Editing the FO in one spot may lead to vertical adjustments at the corresponding spot in the PDF: more lines, less lines, lines moving to the following page, etc... These adjustments may affect everything that comes after it.

For the same reason, you should always look for the next problem after you have fixed the previous one. For instance, don't make a list of all widowed headers in the PDF and then start fixing them all in the FO file. Fixing a widowed header moves all the text below it downward, possibly creating new widowed headers and un-widowing others.
In general, you can keep the FO file open throughout the process. Just don't forget to save your changes before you rebuild the PDF. You must close the PDF before every rebuild though: once it's opened in Adobe (even in Adobe Reader), other processes can't write to it.
The entire process can be pretty time-consuming, so don't try to fix every tiny little imperfection, especially if you're a beginning FO hacker. In general, only the widowed headers are really ugly, and make the document look very unprofessional.

The next section deals with the various problems and how to solve them.

Common problems and their solutions

Widowed headers

Problem: Headers or titles at the bottom of the page.

Cause: Apache FOP doesn't support the keep-with-next attribute yet, except in table rows.

Solution: Force a page break at the start of the element (usually a section, but it may also be a list, table, or other element) that the title or header belongs to.

How: If the element has an id attribute (you can see this in the DocBook source), do a search on the id in the FO file. As an example, suppose that you've just built the Firebird 2 Quick Start Guide and you find that the title Creating a database using isql is positioned at the bottom of a page. In the DocBook XML source you can see that this is the title of a section whose id is qsg2-databases-creating. If you search on qsg2-databases-creating from the top of the file, your first hit will probably look like this:

<fox:outline internal-destination="qsg2-databases-creating">

The fox:outline elements correspond to the links in the navigation frame on the left side of the PDF. So this is not yet the section itself; you'll have to look further. Next find:

<fo:block text-align-last="justify" end-indent="24pt"
          last-line-end-indent="-24pt"><fo:inline
   keep-with-next.within-line="always"><fo:basic-link
   internal-destination="qsg2-databases-creating">Creating a database...

Here, the id is an attribute value in a fo:basic-link. We're in the Table of Contents now. Still not there.

The third and fourth finds are often a couple of lines below the second, to create the link from the page number citation in the ToC. But the fifth is usually the one we're looking for (unless there are any more forward links to the section in question):

<fo:block id="qsg2-databases-creating">

That's it! Most mid- and low-level hierarchical elements in DocBook (preface, section, appendix, para etc.) wind up as a fo:block in the FO file. Now we have to tell Apache FOP that it must start this section on a new page. Edit the line like this:

<fo:block id="qsg2-databases-creating" break-before="page">

Save the change and rebuild the PDF (remember: use build fo2pdf, not build pdf). The section title will now appear at the top of the following page, and you can move on to the next problem.

What if the element has no DocBook id? You'll have to search on (part of) the title/header then. This is a bit trickier, because the title may contain a line break in the FO file, in which case it won't be found. Or the title element has one or more children of its own (e.g. quote or emphasis). This too will keep you from finding it if you search on the full title. On the other hand: the more you shrink the search term, the higher the probability that you will get a number of unrelated hits. You'll have to use your own judgement here; if there is some characteristic text shortly before or after the title you can also search on that, and try to locate the title in the lines above and below it.

No matter how, once you've found the title, go upward in the FO file until you find the beginning of the section – often identifiable by the auto-generated FO id:

</fo:block>
<fo:block id="d0e2340">
  <fo:block>
    <fo:block>
      <fo:block keep-together="always" margin-left="0pc"
                font-family="sans-serif,Symbol,ZapfDingbats">
        <fo:block keep-with-next.within-column="always">
          <fo:block font-family="sans-serif" font-weight="bold"
                    keep-with-next.within-column="always"
                    space-before.minimum="0.8em" space-before.optimum...
                    space-before.maximum="1.2em" color="#404090" hyph...
                    text-align="start">
            <fo:block font-size="11pt" font-style="italic"
                      space-before.minimum="0.88em" space-before.opti...
                      space-before.maximum="1.32em">The DISTINCT keyword
              comes to the rescue!</fo:block>

As you see, there may be quite a number of lines between the section start and the title text. Notice, by the way, how the title is split over two lines here.

Once you've found the fo:block that corresponds to the section start, give it a break-before="page" attribute just like we did before.

Why look for the section start and not apply the break-before attribute to the fo:block immediately enclosing the title? Well, this will print the title on the next page all right, but links from the Outline and the ToC will point to the previous page, because the (invisible) section start – the block tag bearing the ID – lies before the page break.

Spaces in filenames, URLs etc.

Problem: Spaces (of varying width) appear at certain positions in filenames etc.

Cause: In the FO file, these elements contain zero-width space characters (ZWSP, Unicode 200B hex) to indicate the points where the string may be broken for line-wrapping. Unfortunately, Apache FOP also uses these spaces for line justification purposes.

Solution: If the spaces are wide enough to be annoying, delete them (except in places where the line does break on such a ZWSP).

How: Searching on the entire URL or filename is guaranteed not to work, precisely because the string contains those zero-width spaces! If the string contains an element that is rare or unique within the file (e.g. hababarulala in the filename C:\Strips\Pintoplaneet\hababarulala.txt), you can use that. Otherwise search on neighbouring text. (You may have already found out that simply browsing the FO file takes way too long, because of the huge amount of XSL-FO overhead which all but drowns the document text.)

Once you have found the string, check if you can actually see the ZWSPs. Some editors don't show them at all (which is probably defensible, given that they have zero width). Even then, you may be able to detect them by walking the string with the cursor keys. If you hit left or right arrow and the cursor doesn't move (but moves again at the next keystroke), that's where the ZWSP is. Some other editors show them with a symbol, yet others (ConText) as three “funny” characters: â€‹. These characters represent the three-byte UTF-8 code for a zero-width space.

If you have identified the zero-width spaces you can delete them, but keep your eye on the PDF file too: don't remove a ZWSP if it is actually used to break the line in the PDF.

Split table rows

Problem: Table rows split across page boundaries.

Cause: Nothing in particular – there's no rule that forbids page breaks to occur within table rows.

Solution: If you want to keep the row together, insert a hard page break at the start of the row.

How: Find the row by searching on text at the beginning of the row or at the end of the previous row. The element you're looking for is a fo:table-row, but don't use that for a search term, because many DocBook elements (not only <table>s) are implemented using fo:tables and thus contain fo:table-rows.

Once the start of the split row is found, add a break-before attribute like you did with widowed headers:

<fo:table-row break-before="page">

Alternatively, you can give the previous row a break-after attribute.

Overly wide horizontal spaces

Problem: Very large horizontal justification spaces on lines above a long spaceless string. These large strings are often printed in monospaced (fixed-width) font:

Cause: Apache FOP often doesn't hyphenate these strings. Therefore, if the string doesn't fit on the line it must be moved to the next line as a whole. This leaves the previous line with “too little” text, making large justification spaces necessary. Note that in the example above, the large spaces on the top line are caused by the string on the line below, not by the one on the line itself.

Solution: You may have good reasons to leave the string unbroken. In that case, accept the wide spaces as a consequence. Otherwise, insert a space (or hyphen-plus-space) at the point where the string should be broken.

How: First find the string in the FO file by searching on (part of) its contents. If it's monospaced in the PDF, you'll almost always find it within a fo:inline element. Then look at the PDF and estimate how much of the as yet unbroken string would fit in the large whitespace on the line above. Back in the FO file, insert a space – possibly preceded by a hyphen – in the string at a location where it's acceptable to break it. Rebuild the PDF (build fo2pdf !) and check the result. If you've broken the string too far to the right, it will still be entirely on the next line. Too far to the left and the whitespace may still be too wide to your liking. Adjust and rebuild until you're satisfied.

One surprise you may get during this job is that, once you've broken the string in one place, Apache FOP suddenly decides that it's OK to hyphenate the rest of the string. This will leave you with a part of the string on the first line that contains your own (now erroneous) space but also extends beyond it. You'll now have to delete your space and break the string again at the spot chosen by Apache.

Inserting zero-width spaces

An alternative approach to the wide-spaces problem is to insert zero-width space characters at each and every point where the culprit string may be broken, leaving it to Apache FOP to work out which one is best suited. This is guaranteed to work at the first try, but:

it's only feasible when you have an editor that'll let you insert ZWSPs easily;
it only works for spots where it's OK to break the string without a hyphen;
it may lead to unwanted whitespace appearing at the ZWSP locations (see earlier topic), forcing you to delete them all again except the one that “did the trick”.

XSL-FO references

The official XSL-FO (Formatting Objects) page is here: http://www.w3.org/TR/xsl/

The Apache FOP homepage is here: http://xmlgraphics.apache.org/fop/

The Apache FOP compliance page is here: http://xmlgraphics.apache.org/fop/compliance.html. It contains a large object support table where you can look up which XSL-FO objects and attributes (properties) are supported. When consulting the table, please bear in mind that we currently use Apache FOP 0.20.5.