Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Mar 1998 23:32:40 +0000
From:      Nik Clayton <nik@nothing-going-on.demon.co.uk>
To:        doc@FreeBSD.ORG
Subject:   Handbook LinuxDoc -> DocBook migration
Message-ID:  <19980315233240.48817@nothing-going-on.org>

next in thread | raw e-mail | index | archive | help
Hi folks,

For those that are interested in the project to migrate the Handbook from
the LinuxDoc DTD to DocBook, please read the following. It's my current
plan for how to do this.

Comment welcomed.

N


                           Handbook DTD Migration

                               By Nik Clayton

                           Email: nik@freebsd.org

Contents

     1   Background
     2   Current Handbook layout
     3   Translation
     4   Branching the repository
     5   Software requirements
     6   Doing the conversion
          6.1   Protect entities
          6.2   Mechanically convert the Handbook to DocBook
     7   Clean up the converted file
          7.1   Reindent
          7.2   Refill the paragraphs
          7.3   Fix entries broken by the previous two changes
          7.4   Replace comments
          7.5   Remove empty markup
          7.6   Fixup the markup choice
          7.7   Fixup other errors
     8   Split the Handbook into smaller files
          8.1   Directory structure and file names
          8.2   Add a DOCTYPE to the split files
          8.3   Add a new entity file, _misc.sgml
          8.4   Bring in the lost entities
     9   Automate converting the Handbook to target formats
     10   Merge changes from HEAD and commit, replacing the HEAD
     11   If you want to help...

 NOTE:

 This document assumes that the reader has a basic understanding of SGML.

Over the past six months or so a consensus has emerged that the FreeBSD
Handbook (the Handbook) should be migrated from its existing DTD (LinuxDoc)
to the DocBook DTD.

This document outlines how that will be achieved. This is a work in
progress, and comment is welcomed. It is also a lengthy document, but I'm
trying to be as thorough as possible.

1   Background

The FreeBSD Handbook is currently marked up using the LinuxDoc DTD. A
variety of tools are then used to convert the Handbook to other formats,
including HTML, ASCII text and Postscript.

It is generally agreed that the LinuxDoc DTD is not up to the task of
encoding the meaning of elements of the Handbook in sufficient detail. It
has been decided that the Handbook should migrate to the DocBook DTD.
DocBook is expressly designed for writing technical documentation such as
the Handbook, and features a rich element set. It is also relatively easy to
extend.

As is often the case in volunteer projects such as FreeBSD (and Linux) no
one has had the time to work through the issues involved in the migration,
and then commit to being able to do the work.

That's recently changed.

I've just been able to commit a large chunk of spare time to this project
and (in conjunction with John Fieber) have been working through the issues
involved.

The rest of this document aims to bring the interested reader up to speed on
what's about to happen to the Handbook, and should provide sufficient detail
for the interested SGML hacker to let me know what I've missed.

2   Current Handbook layout

The Handbook is currently organised as a collection of files (with a .sgml
extension) in one directory.

Some of these files contain SGML entity definitions, and exist only to be
included in the other SGML files. The other files form the chapters and
sections of the handbook. Some chapters are entirely in one file, others are
split and are stored in several files.

This is (IMHO) mildly annoying. The migration process provides an
opportunity to address this.

3   Translation

The FreeBSD HandBook has been translated to Japanese, and the Japanese
translators track changes to the Handbook and convert the changes by hand.

In order to make their task easier during this migration, almost all the
changes made will be automated, allowing the Japanese team to easily
replicate them.

In addition, no changes to the content of the Handbook will be made until
after the Handbook has been converted to DocBook.

4   Branching the repository

The migration process consists of a number of discrete steps. The state of
the Handbook mid-way through these steps is not suitable for ``public
consumption''.

In addition, it is possible that unforeseen problems will occur during the
migration.

It would be possible to do all of the migration process ``offline'', and
only commit the converted Handbook when the process was complete. However,
this would deny the Japanese translators access to the diffs of the
Handbook's state as it is converted.

For these reasons, the conversion process must not happen on the HEAD of the
CVS repository. Instead, the CVS doc repository will be branched with the
tag ``LINUXDOC_2_DOCBOOK''. All the migration commits will happen along this
branch.

5   Software requirements

The conversion process requires the following applications and other pieces
of software. Most of these are available in the FreeBSD ports collection.

   * LinuxDoc DTD (textproc/linuxdoc)

   * DocBook DTD (textproc/docbook)

   * jade and nsgmls (textproc/jade)

   * instant and the LinuxDoc to DocBook translation specification
     (textproc/sgmlformat)

   * perl for entity protection (lang/perl5)

   * xemacs and the psgml package to make editing the Handbook considerably
     simpler (editors/xemacs20). psgml also features commands to assist in
     the reformatting of the Handbook.

   * Norm Walsh's Modular DocBook stylesheet, available at
     http://www.berkshire.net/~norm/dsssl/docbook/.

     Norm has written DocBook stylesheets that can be used with Jade to
     transform text marked up in DocBook to HTML and RTF. These stylesheets
     will be used to convert the Handbook to HTML and RTF. From there (in
     the short term) the RTF can be converted to LaTeX and thence to
     Postscript. The HTML can be used to generate the plain text version of
     the Handbook.

     In the long term, the JadeTeX backend will be used to convert the
     Handbook to TeX, and from there to Postscript.

6   Doing the conversion

Trying to convert the Handbook from LinuxDoc to DocBook on a file by file
basis won't work. At least, not without an immense amount of grief.

Instead, John Fieber's linuxdoc-docbook translation specification and
instant will be used to do the conversion.

However, the conversion process has a number of interesting wrinkles.

  1. Entity definitions will be lost (see below for a workaround).

  2. Comments in the individual source files will be lost. This is
     (potentially) a problem, since some of those comments contain copyright
     notices, reminders and so forth.

     They will need to be manually added back into the converted file.

  3. Because the migration process is moving from a less expressive DTD to a
     more expressive DTD, the converted document will not take full
     advantage of the elements in the target DTD (DocBook). The translated
     document will need to be examined, and some markup substituted.

  4. The result of the conversion process is one large file. This will need
     splitting up into smaller files. It is here that the opportunity to
     reorganise and rename the files that comprise the Handbook arises.

6.1   Protect entities

The Handbook uses general entities to represent replaceable text. For
example, &a.jkh; expands to Jordan's name and e-mail address.

Unfortunately, the conversion process will cause all these entities to
expand to their full representation, and the entity definition will be lost.

The solution is to protect each entity, by making each entity refer to its
own name, while storing its expanded form elsewhere.

I have a relatively simple Perl script that does this. In essence, it looks
for

      <!ENTITY foo "This is some text">

and converts it to

      <!ENTITY foo CDATA "-entity-foo;" -- "This is some text" -->

which ensures that the process can be reveresed.

A similar operation is performed on entities that refer to files in the
SYSTEM, rather than containing CDATA.

6.2   Mechanically convert the Handbook to DocBook

The Handbook can now be converted to the DocBook DTD. The command line to
accomplish this is

     # nsgmls -c /usr/local/share/sgml/linuxdoc/catalog handbook.sgml | \
         instant -t /usr/local/share/sgml/transpec/linuxdoc-docbook.t | \
         sed -e 's/-entity-/\&/g' > handbook.docb

The result of this conversion is a file containing syntactically valid but
very ugly DocBook markup. This converted file is not yet ready to be
converted to HTML.

The one redeeming feature of this process is it is completely automatic.
This should allow anyone working on the translation of the Handbook to do
the same thing to the translated version and get the same results.

7   Clean up the converted file

After the conversion, the Handbook SGML file will be cleaned up. Each one of
these steps is a separate commit.

7.1   Reindent

The file will be reindented. The easiest way to do this is to load the file
into xemacs, activate sgml-mode and run the following function (which is not
a part of psgml).

     (defun sgml-indent-buffer
       "Indents the current buffer, one line at a time"
       (interactive "*")
       (save-excursion
         (goto-char (point-min))
         (while (= (forward-line 1) 0)
           (sgml-indent-or-tab))))


7.2   Refill the paragraphs

Again, the easiest way to refill (rewrap) the paragraphs in the file is to
use sgml-mode. In this case, the cursor can be placed on the first element
in the new Handbook (``book'') and M-x sgml-fill-element will be run.

7.3   Fix entries broken by the previous two changes

The two previous changes have an undesirable side effect. Any content that
is supposed to be rendered based on its layout in the SGML source will have
had that layout corrupted.

Examples of this include program listings, examples for the user to type in,
PGP key blocks, and so on.

These entries will need to be fixed up (reindented, and so on) by hand,
based on their layout in the original handbook.

 Automating this:

 This cannot (currently) be automated. I'm talking with the author of
 PSGML, and the next version will have a mechanism to force
 sgml-fill-element to leave certain elements untouched. This version of
 PSGML is still in alpha test.

 However, if you're a LISP programmer with some spare time, I'd be grateful
 if you could take a look at PSGML and see if you can put together a quick
 hack.

 I think a check needs adding to the sgml-fill-element function that causes
 it to bail out early if the context for the current element is in a list
 of elements that should not be ``filled''.

7.4   Replace comments

The conversion process strips out the comments from the source files. The
existing source files will need to be examined, and the comments inserted
into the new file.

7.5   Remove empty markup

Due to formatting in the original Handbook, the conversion process writes a
number of empty elements, mostly

     <para></para>

These will be removed.

7.6   Fixup the markup choice

The conversion process selects sub-optimal markup in many cases. This is
because the translation is from a less expressive DTD to a more expressive
DTD.

For example, the converted document has a slew of

      <emphasis role="tt">/a/file/name</emphasis>

that should be converted to

      <filename>/a/file/name</filename>

A cursory inspection of the converted file shows opportunities in the
Handbook for filename, prompt, acronym, command, application, userinput and
so on.

There are also many sections designated as notes, which should be marked up
using the appropriate element from note, warning, tip and so on.

Each markup change will be a separate commit. So, all the changes to use the
filename element will be made and committed, then all the changes to use
prompt will be committed, and so on.

7.7   Fixup other errors

The mechanical conversion introduces other errors that cause the converted
handbook to fail to validate. These will be fixed on a case by case basis
until the Handbook validates.

8   Split the Handbook into smaller files

8.1   Directory structure and file names

I intend to split the Handbook into files organised along chapter lines.
Each chapter in the Handbook will get its own directory, and the primary
file for that chapter will be called chapter.sgml in its directory.

The content for an individual chapter may be contained entirely within the
chapter.sgml file. Or it might be separated out further along section lines.

The biggest benefit of splitting out the chapters into individual
directories is that it prepares the Handbook for the time when graphical
content will be included in some chapters. Splitting the content into
directories allows all the content for a particular chapter to be kept
together.

The top level handbook directory will contain handbook.sgml as it currently
does. In addition, there will be other .sgml files. These files will contain
entity definitions in the same way that authors.sgml, lists.sgml and
sections.sgml do at the moment.

To emphasise that these files contain content that will be used by other
files, and are not directly usable themselves (they have no DOCTYPE) the
filenames will start with a leading underscore.

 Current filename     New filename
 ---------------------------------
 authors.sgml         _authors.sgml
 lists.sgml           _lists.sgml
 sections.sgml        _chapters.sgml

These entity files will also need converting to DocBook.

8.2   Add a DOCTYPE to the split files

 NOTE:

 I don't know if this is actually possible. I include it here so that
 someone with more SGML knowledge than myself can comment on the idea.

With the Handbook as it stands at the moment, it's hard to work on a part of
it and see the results of your change without rebuilding the entire
handbook. This can take a while.

At the moment, each file that comprises the Handbook can not be processed on
its own, it has to be processed has a part of the entire Handbook.

I'd like to change this so that you could do

     % cd /top/of/handbook
     % make
     ... this converts the entire Handbook ...
     % cd Introduction
     % make
     ... this just makes the Introduction chapter ...


obviously, the results of just making the introduction would probably
contain unresolved references to internal link targets, entities and so on.
But it's only intended to allow an author to check their work in progress
without needing to rebuild the whole Handbook.

At first, I thought this could be accomplished by adding a DOCTYPE to the
top of each chapter file. However, this fails when building the entire
Handbook, since it then has multiple DOCTYPE entries.

Instead, I thought something like the following could be used;

     <![ %doctype [
     <!DOCTYPE CHAPTER PUBLIC "-//Davenport//DTD DocBook V3.0//EN">
     ]]>

to determine whether or not to include the DOCTYPE. But some simple tests
show that this doesn't work either. As I say, comments from the SGML
cognoscenti welcome.

8.3   Add a new entity file, _misc.sgml

I think there's a place for a description of miscellanous entities to be
used in the Handbook, mostly to help ensure consistency.

Right now I only have two in mind,

     <!ENTITY prompt.root "<prompt>#</prompt>">
     <!ENTITY prompt.user "<prompt>%</prompt>">


which would be used in all examples that needed to indicate whether the user
was to perform a particular action as root or as a regular user.

8.4   Bring in the lost entities

The current handbook.sgml includes some entity definitions at the top of the
file that will be lost in the conversion process. They need to be added back
in.

9   Automate converting the Handbook to target formats

At a minimum the Handbook must be convertible to plain text, HTML and
Postscript. Jade provides backends to convert to a number of other formats,
including RTF, TeX and an SGML translation.

Using Norm Walsh's stylesheets, HTML can be produced with

     % jade -t sgml -d /path/to/docbook.dsl handbook.sgml

and RTF with

     % jade -t rtf -d /path/to/docbook.dsl handbook.sgml

The use of the TeX backend is still being worked on. I plan on tracking down
more information about this over the coming week.

In order to make the Handbook buildable with easily available FreeBSD tools,
a port will need to be made of both Norm Walsh's DocBook stylesheets and the
JadeTeX macros.

As normal, the Handbook will have a Makefile to help automate the
conversion. Initially, I expect that this Makefile will be self contained
rather than relying on bsd.sgml.mk. This is so that the generation of the
FAQ (which will be a separate conversion project) can continue unchanged.
Eventually, of course, both the FAQ and the Handbook will be marked up in
DocBook, and then the common Makefile code can move to bsd.sgml.mk.

10   Merge changes from HEAD and commit, replacing the HEAD

When the migration is complete, a diff of the Handbook on the HEAD will be
taken, and any content changes that were made while the Handbook was
migrated will be applied to the DocBook version (by hand).

At this point the DocBook conversion of the Handbook can replace the current
LinuxDoc version.

11   If you want to help...

If you've read all of the above then firstly, my thanks.

Secondly, if you've got any comments or suggestions, please feel free to
make them.

Thirdly, about the only thing in all this that I'm not completely sure about
is the final conversion to TeX. In particular, I haven't experimented with
the JadeTeX macros yet. If you have, or your pretty handy with TeX and fancy
volunteering to answer some of my (quite possibly) silly questions, please
step forward.
-- 
Work: nik@iii.co.uk                       | FreeBSD + Perl + Apache
Rest: nik@nothing-going-on.demon.co.uk    | Remind me again why we need
Play: nik@freebsd.org                     | Microsoft?

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980315233240.48817>