What you really need to know about SGML

I tried to keep this section short. However, I cannot explain anything without a small basis of concepts about SGML. So, let's go, before we switch to actual source code.

What is structured documentation?

Structured documentation is built upon structured elements: chapters, sections, paragraphs, etcetera, where all elements are clearly labeled for what they are: references, program output, etc. No explicit information about how the document should be rendered is given; only about its structure (and content). When there are explicit rules for presentation, they are kept outside the SGML-document.

This allows for automatic processing of the documents, without waiting for AI systems. It encourages authors to concentrate on structure, which conveys meaning.

Thus, the question "How do I put a word in bold with SGML?" has little relevance. One could ask how to put emphasis on a certain stretch of text.

What is SGML?

Standard Generalized Markup Language is a standardized language intended to facilitate the authoring of structured documentation [1]. More specifically, it is a meta-language. You never actually type SGML, but SGML is used to describe a document type specific structured language (this is called a DTD, a Document Type Definition), which defines how specific documants might be structured (written).

Therefore, saying that a document is "in the SGML format" is technically correct, but deceptive. One could say that a document is in the DocBook format or the LinuxDoc format or the TEI format.

What does SGML look like?

SGML is a markup language. All SGML documents include text, mixed with tags, which delimits elements[2]. SGML allows several syntaxes to be used, but we'll stick with the reference syntax, the most common, where tags are enclosed between angle brackets, < and >. Here is an example:


<article>

<title>The Foo software</title>

<para>
Foo is very fast. And its documentation can be read easily.
</para>

	

If it looks like HTML to you, it is because HTML is (theoretically) a DTD of SGML.

Elements have a content. For instance, the content of the above para element is "Foo is very fast. And its documentation can be read easily.".

Elements can have attributes to indicate more information. For instance:


<example tested="true">
	    *c++;
</example>

	

You can have also entities which allow you to parametrize some text. For instance, if you often refer to "the Best Operating System, Debian" and you want to avoid typing it each time or, worse, having to change every occurrence if you finally decide a more modest wording, you can declare an entity, let's call it "debian" and use it with the ampersand "&debian;"[3].

One element is special: the root element is the global element, which contains everything. In XML, the DOCTYPE line indicates which element is the root. Here is an example (It seems there is a bug in the SGML environment of Debian 2.2, which requires a full path name for the DTD below. If so, this is a bug and I will investigate it):


<!DOCTYPE article PUBLIC "-//Norman Walsh//DTD DocBk XML V3.1//EN"
     "dtd/docbook-xml/docbookx.dtd"[

      

And the XML files?

You'll learn later about XML. Let's just say that XML files begin first with a processing instruction, which starts with <? and, in that case indicates it is a XML file, as well as some meta-information. Example:


	<?xml version="1.0" encoding="utf-8"?>

	

XML files must be well-formed, which means that tags must be balanced (no crossing of tags which is common in the HTML output of many Web editors) and can be valid which means conformant to their DTD.

Start-tags must always have an end-tag in XML, but you can have empty elements where the start-tag and the end-tag merge in a tag written with a / at the end like:


	    <foobar/>

	  

What is a DTD?

A Document Type Definition is the description (in SGML) of a specific language. You can write your own DTD (it is not very difficult, especially in XML) or you can use an already-existing DTD, which is convenient if you want to exchange documents with other people. Several such DTDs exist, typically for the purposes of a given group of people (astronoms, chemists, scholars in ancient literature...).

The DTD lists the allowed elements and their relationships (for instance, it says a chapter must have at least one section).

Typical DTDs that you may find useful:

At the beginning of a document, you will find a reference to the DTD to use (there are several ways to indicate such references; the following example is for LinuxDoc):


<!doctype linuxdoc system>

<article>

<title>The Linux Kernel HOWTO

	

Which DTD to choose?

Very often, you'll have no choice: the project you're a part of will have chosen already. Since standardization is of course very important in a big project, there is little chance you'll be able to change that. For instance, Linux Documentation Project uses LinuxDoc, FreeBSD, GNOME or KDE use DocBook, etc.

If you have the choice, I suggest to stay close to what similar projects are doing. If you write technical documentation for computer hardware or software, this probably means using DocBook.

How do I write SGML?

Since SGML is a markup language, you can use any editor, like vi or even cat.

But it is often easier with an editor which helps you inserting tags, knowing, for example, which are valid. I recommend Emacs with its SGML mode.

What is XML?

XML (Extensible Markup Language ) is a subset of SGML, a sort of SGML--. It was designed first for the World-Wide Web, but it is now used in unrelated areas.

XML is much simpler than SGML, with less options, so a parser is lighter and faster.

What is a stylesheet?

In the markup world, you try to separate content from presentation. Content is expressed in the SGML document, following a given DTD. Presentation is expressed outside of the document, typically in a DTD-specific stylesheet, which is a description, in an appropriate language (DSSSL - Document Style Semantics and Specification Language - is the most common[4]), of the layout rules for documents written for a certain DTD.

For instance, it is the author of the stylesheet who will decide that titles should be rendered in bold, that URLs will be printed in red, etc.

If you know the CSS (Cascading Style Sheets) language, do note that typical languages for SGML stylesheets are more complicated: they allow not only to specify the rendering of an element, but also the reordering of elements, computation of data from some elements, etc. DSSSL, for instance, is a full blown programming language (based on Scheme), enriched with stylesheet constructs.

Notes

[1]

It has other uses, such as data interchange.

[2]

Depending on the DTD, the end-tag can be mandatory or not. In XML, end-tags are always mandatory.

[3]

This is reference entities. SGML use other types of entities, which are not covered in this HOWTO.

[4]

The XML world created a new language, XSL, which has few implementations at this moment. Despite what you may read in executive summaries, it is perfectly acceptable to use DSSSL to render XML files.