Chapter 3. Validating DocBook Documents
$Revision$
A key feature of XML markup is that you validate it. The DocBook schema is a precise description of valid nesting: the order of elements, and their content. All DocBook documents must conform to this description or they are not DocBook documents (by definition). The validation technology that is built into XML is the Document Type Definition or DTD. A validating parser is a program that can read the DTD and a particular document and determine whether the exact nesting and order of elements in the document is valid according to the DTD.
DocBook is now defined by a RELAX NG grammar and Schematron rules, so it is no longer necessary to validate with the DTD. In fact, it isn’t even very valuable since the DTD version doesn’t enforce many DocBook constraints. Instead, an external RELAX NG and Schematron validator must be used.
Validation is performed on a document after it has been parsed. It is possible for parsing errors to occur as well as validation errors (if, for example, your document isn’t well-formed XML). We’re going to assume that your documents are well-formed and not discuss XML parsing errors.
If you are not using a structured editor that can provide guidance on the markup as you type, validation with an external tool is a particularly important step in the document creation process. You cannot expect to get rational results from subsequent processing (such as document publishing) if your documents are not valid.
There are several free validators that will handle RELAX NG and Schematron, including Jing and MSV. For more detail about available tools, see Section C.9, “XML Tools”.
3.1. ID/IDREF Constraints and Validation
In XML, attributes of type ID and IDREF
provide a straightforward cross-referencing mechanism. In DocBook, the
xml:id
attribute contains values of
type ID, and the linkend
attribute
contains values of type IDREF.
Within any document, no two attributes of type ID may have the same value. In addition, for any attribute of type IDREF, there must be one, and only one, instance of an attribute of type ID with the same value in the same document. In other words, you can’t have two elements with the same ID, and every IDREF must match an ID that exists somewhere in the same document.
Note
Checking these constraints is not a core part of RELAX NG. If you
want RELAX NG to check them, you need to enable a set of “DTD
compatibility” extensions. Unfortunately, the DTD compatibility
extensions do not work well with the DocBook grammar. However, because
DocBook uses xml:id
for its ID
attribute, it’s not necessary to enforce the constraints with RELAX NG.
You can either tell your processor not to perform the DTD compatibility
extension checks, or ignore the warning messages that they produce. You
could also use Schematron to check ID/IDREF constraints.
3.2. Validating Your Documents
RELAX NG validation is supported natively in some editing
environments, such as <oXygen/> (see Figure 3.1, “<oXygen/> XML Editor validation”),
XMLmind, and Emacs (with
nxml-mode
). <oXygen/> and XMLmind also support Schematron validation. There are also several
standalone validators—for example, Jing and Sun’s Multi-Schema Validator
(MSV).
If your documents make use of several namespaces (e.g., SVG and MathML embedded in DocBook), tools that support NVDL, Namespace-based Validation Dispatching Language [NVDL], can greatly simplify complex validation scenarios.
3.3. Understanding Validation Errors
Every validator produces slightly different error messages, but most indicate exactly (at least technically[2]) what is wrong and where the error occurred. With a little experience, this information is all you’ll need to quickly identify what’s wrong.
In the rest of this section, we’ll look at a number of common errors and the messages they produce in MSV. We’ve chosen MSV because it generally produces informative error messages.
3.3.1. Character Data Not Allowed Here
Out-of-context character data is frequently caused by a missing start tag, but sometimes it’s just the result of typing in the wrong place!
<chapter xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
You can't put character data here.
<para>
<emphasis role="bold">This</emphasis> paragraph contains
<emphasis>some <emphasis>emphasized</emphasis> text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>
java -jar msv.jar docbook.rng badpcdata.xml
start parsing a grammar.
warnings are found. use -warning switch to see all warnings.
validating badpcdata.xml
Error at line:10, column:7 of badpcdata.xml
unexpected character literal
You can’t put character data directly in a
chapter
. Here, a wrapper element, such as
para
, is missing around the sentence between the
first two paragraphs.
3.3.2. Misspelled Start Tag
If you spell a start tag incorrectly, or use the wrong namespace, the parser will get confused:
<chapter xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
<paar>
<emphasis role="bold">This</emphasis> paragraph contains
<emphasis>some <emphasis>emphasized</emphasis> text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>
java -jar msv.jar docbook.rng misspell.xml
start parsing a grammar.
warnings are found. use -warning switch to see all warnings.
validating misspell.xml
Error at line:9, column:7 of misspell.xml
tag name "paar" is not allowed. Possible tag names are: <address>,<anchor>,
<annotation>,<bibliography>,<bibliolist>,<blockquote>,<bridgehead>,
<calloutlist>,<caution>,<classsynopsis>,<cmdsynopsis>,
<constraintdef>,<constructorsynopsis>,<destructorsynopsis>,<epigraph>,
<equation>,<example>,<fieldsynopsis>,<figure>,<formalpara>,
<funcsynopsis>,<glossary>,<glosslist>,<important>,<index>,
<indexterm>,<informalequation>,<informalexample>,<informalfigure>,
<informaltable>,<itemizedlist>,<literallayout>,<mediaobject>,
<methodsynopsis>,<msgset>,<note>,<orderedlist>,<para>,
<procedure>,<productionset>,<programlisting>,<programlistingco>,
<qandaset>,<refentry>,<remark>,<revhistory>,<screen>,<screenco>,
<screenshot>,<section>,<section>,<segmentedlist>,<sidebar>,
<simpara>,<simplelist>,<simplesect>,<synopsis>,<table>,<task>,
<tip>,<toc>,<variablelist>,<warning>
Luckily, these are usually easy to spot, unless you accidentally spell the name of another element. In that case, your error might appear to be out of context.
3.3.3. Out-of-Context Start Tag
Sometimes the problem isn’t spelling, but rather placing a tag in the wrong context. When this happens, the validator tries to figure out what it can add to your document to make it valid. Then it tries to recover by continuing as if it had seen what it just added. Of course, this may cause future errors:
<chapter xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
<para><title>Paragraph With Inlines</title>
<emphasis role="bold">This</emphasis> paragraph contains
<emphasis>some <emphasis>emphasized</emphasis> text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is unremarkable in
every regard. This is a paragraph in the test chapter. It is
unremarkable in every regard. This is a paragraph in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>
$ java -jar msv.jar docbook.rng context.xml
start parsing a grammar.
warnings are found. use -warning switch to see all warnings.
validating context.xml
Error at line:9, column:14 of context.xml
tag name "title" is not allowed. Possible tag names are: <abbrev>,<accel>,
<acronym>,<address>,…,<varname>,<warning>,<wordasword>,<xref>
In this example, we probably wanted a
formalpara
so that we could have a title on the
paragraph. But note that the validator didn’t suggest this alternative.
The validator only tries to add additional elements, rather than rename
elements that it’s already seen.
[2]
It is often the case that you can correct an error in the document in several ways. The validator suggests one possible fix, but this is not always the right fix. For example, the validator may suggest that you can correct out-of-context data by adding another element, when in fact it’s “obvious” to human eyes that the problem is a missing end tag.