XML, the Extensible Markup Language, is a simple dialect of SGML. In the words of the XML specification, “the goal [of XML] is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.”
XML raises two issues with respect to DocBook:
Are DocBook SGML instances valid XML instances?
Can the DocBook DTD be made into a valid XML DTD?
If you have an existing SGML system, and your primary goal is to serve DocBook documents over the Web as XML, only the first of these issues is relevant. As the popularity of XML grows, we will see more and more XML-aware tools that don't implement full ISO 8879 SGML. If your goal is to author DocBook documents with one of this new generation of tools, you will only be able to achieve validity with an XML DocBook DTD.
Although not yet officially adopted by the OASIS DocBook Technical Committee, an XML version of DocBook is available now and provided on the CD-ROM.
Most DocBook documents can be made into well-formed XML documents very easily. With few exceptions, valid DocBook SGML instances are also well-formed XML instances. The following areas may need to be addressed.
It is common for SGML instances to use only a public identifier in document type and parameter entity declarations:
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN"> <chapter><title>Chapter Title</title> <para> This <emphasis>paragraph</paragraph> is important. </para> </chapter>
XML requires a system identifier:
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> <chapter><title>Chapter Title</title> <para> This <emphasis>paragraph</paragraph> is important. </para> </chapter>
If you're used to using catalog files to resolve system identifiers, you may be dismayed to learn that system identifiers are required. Because most tools favor system identifiers over public identifiers, all of the portability that was gained by the use of catalog files seems to have been lost. In the long run, it'll be regained by the fact that XML system identifiers can be URNs, which will have a resolution scheme like catalogs, but what about the short run?
Luckily, there are a couple of options. First, you can tell your tools to use the public identifiers even though system identifiers are present. Simply add:
OVERRIDE YES
to your catalog files. Alternatively, you can remap system identifers
with the SYSTEM
catalog directive. If you are faced with
documents that don't use public identifiers at all, this is probably your
only option.
If you have used SGML minimization features in your instances:
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN"> <chapter id=chap1><title>Chapter Title</title> <para> This <emphasis>paragraph</> is important. </para> </chapter>
they will not be well-formed XML instances. In particular, XML
XML also forbids tag omission, and there are probably a half dozen or so more exotic examples of minimization that you have used. They're all illegal. The easiest way to remove these minimizations is probably with a tool like sgmlnorm (included in the SP and Jade distributions, on the CD-ROM).The result will be something like this:
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> <chapter id="chap1"><title>Chapter Title</title> <para> This <emphasis>paragraph</emphasis> is important. </para> </chapter>
Correct processing of this document may require access to the default attributes:
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN"> <chapter><title>Chapter Title</title> <para> Write to us at: <address> 90 Sherman Street Cambridge, MA 02140 </address> </para> </chapter>
|
Some XML processing environments are going to ignore the doctype declaration in your document, even if it's present. This is relevant when your instance uses elements that have attributes with default values. The default values are expressed in the DTD, but may not be expressed in your instance. In the case of DocBook, there are relatively few of these, and your stylesheet can probably be constructed to do the right thing in either case. (It essentially treats the attributes as if they had implied values.)
The result will be something like this:
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> <chapter><title>Chapter Title</title> <para> Write to us at: <address format="linespecific"> 90 Sherman Street Cambridge, MA 02140 </address> </para> </chapter>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN"> <chapter><title>Chapter Title</title> <para> This book was published by O'Reilly™. </para> </chapter>
The result will be something like this:
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [ <!ENTITY trade "™"> <chapter><title>Chapter Title</title> <para> This book was published by O'Reilly™. </para> </chapter>
<!DocType Book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"> <book><title>Book Title</title> <chapter><title>Chapter Title</Title> <para> Paragraph test. </para> <PARA> A second paragraph. </PARA> </chapter> </book>
With the standard DocBook SGML declaration, DocBook instances are not case-sensitive with respect to element and attribute names. XML is always case-sensitive. As long as you have used the same case consistently, your XML instances will be well-formed, but it may still be advantageous to do some case-folding because it will simplify the construction of stylesheets.
Keywords in XML are case-sensitive, and must be in uppercase. | |
The name declared in the document type declaration, like all other names, is case-sensitive. | |
In XML, |
The result will be something like this:
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> <book><title>Book Title</title> <chapter><title>Chapter Title</title> <para> Paragraph test. </para> <para> A second paragraph. </para> </chapter> </book>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN"> <chapter><title>Chapter Title</title> <indexterm id="idx-bor"><primary>Something</primary></indexterm> <para> Paragraph test. </para> <indexterm startref="idx-bor"> </chapter>
The StartRef
attribute on
indexterm
and the OtherTerm
attribute on GlossSee
and GlossSeeAlso
are #CONREF
attributes.
In SGML terms, this means that when these attributes are used, the content of the tag is taken to be the same as the content of the tag pointed to by the attribute.
If you have used these attributes, your instance will contain both empty and non-empty versions of these tags. |
Your best bet is to transform the #CONREF
version into an empty tag and let your stylesheet deal with it appropriately.
The result will be something like this:
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> <chapter><title>Chapter Title</title> <indexterm id="idx-bor"><primary>Something</primary></indexterm> <para> Paragraph test. </para> <indexterm startref="idx-bor"/> </chapter>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN" [ <!ENTITY % draft "IGNORE"> <!ENTITY % sourcecode "CDATA"> ]> <chapter><title>Chapter Title</title> <![ %draft; [ <para> Draft paragraph. </para> ]]> <para> The following code is totally out of context: <programlisting> <![ %sourcecode; [ if (x < 3) { y = 3; } ]]> </programlisting> </chapter>
The result will be something like this:
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> <chapter><title>Chapter Title</title> <para> The following code is totally out of context: <programlisting> <![CDATA[ if (x < 3) { y = 3; } ]]> </programlisting> </chapter>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN" [ <!ENTITY % sourcecode SYSTEM "program.c" CDATA> ]> <chapter><title>Chapter Title</title> <para> The following code is totally out of context: <programlisting> &sourcecode; </programlisting> </chapter>
XML instances cannot use CDATA
or SUBDOC
external entities. One option for integrating external
CDATA
content into a document is to employ a pre-processing pass
that inserts the content inline, wrapped in a CDATA
marked
section.
SUBDOC
entities may be more problematic. If you do
not require validation, it may be sufficient to simply put them inline. XML
namespaces may offer another possible solution.
The result will be something like this:
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> <chapter><title>Chapter Title</title> <para> The following code is totally out of context: <programlisting> <![CDATA[ int main () { .. } ]]> </programlisting> </chapter>
Converting the DocBook DTD to XML is much more challenging than converting the instances. It is probably not possible to construct an XML DTD that is identical to the validation power of DocBook. The list below identifies most of the issues that must be addressed, and describes how the DocBook XML DTD; deals with them:
Most of them have been moved to comment declarations preceding the markup declaration that used to contain them. A few small, inline comments that seemed like they would be out of context if moved before the declaration were simply deleted.
The small number of places in which DocBook uses name groups have been expanded.
There's one downside: DocBook uses %admon.class;
in a name
group to define the content model, and attribute lists for elements in the
admonitions class. In DocBook XML, this convenience cannot be expressed. If additional
admonitions are added, the element and attribute list declarations will have
to be copied for them.
CDATA
or RCDATA
declared content
Graphic
and InlineGraphic
have
been made EMPTY
. The content model for SynopFragmentRef
, the only RCDATA
element in DocBook, has been
changed to (arg | group)+
.
In DocBook, exclusions are used to exclude the following:
Ubiquitous elements (indexterm
and BeginPage
) from a number of contexts in which they
should not occur (such as metadata, for example).
Formal objects from Highlights
,
Example
s, Figure
s and LegalNotice
s.
Formal objects and InformalTable
s
from tables.
Removing these exclusions from DocBook XML means that it is now valid, in
the XML sense, to do some things that don't make a lot of sense (like put
a Footnote
in a Footnote
). Be careful.
Inclusions in DocBook are used to add the ubiquitious elements (
indexterm
and BeginPage
) unconditionally to a
large number of contexts. In order to make these elements available in
DocBook XML,
they have been added to most of the parameter entities that include
#PCDATA
. If new locations are discovered where these terms are desired, DocBook XML
will be updated.
#PCDATA
first.
The content models of many elements have been updated to make them a
repeatable OR group beginning with #PCDATA
.
NAME
,
NUMBER
, NUTOKEN
, and so on) are not allowed#CONREF
attributes allowed.
The #CONREF
attributes on indexterm
,
GlossSee
, and GlossSeeAlso
were changed to
#IMPLIED
. The content model of indexterm
was
modified so that it can be empty.