From owner-freebsd-doc Sun Mar 15 15:33:44 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id PAA01865 for freebsd-doc-outgoing; Sun, 15 Mar 1998 15:33:44 -0800 (PST) (envelope-from owner-freebsd-doc@FreeBSD.ORG) Received: from nothing-going-on.demon.co.uk (nothing-going-on.demon.co.uk [193.237.89.66]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id PAA01771 for ; Sun, 15 Mar 1998 15:33:18 -0800 (PST) (envelope-from nik@nothing-going-on.demon.co.uk) Received: (from nik@localhost) by nothing-going-on.demon.co.uk (8.8.8/8.8.8) id XAA26481; Sun, 15 Mar 1998 23:32:41 GMT (envelope-from nik) Message-ID: <19980315233240.48817@nothing-going-on.org> Date: Sun, 15 Mar 1998 23:32:40 +0000 From: Nik Clayton To: doc@FreeBSD.ORG Subject: Handbook LinuxDoc -> DocBook migration Reply-To: nik@FreeBSD.ORG Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.89.1i Organization: Nik at home, where there's nothing going on Sender: owner-freebsd-doc@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hi folks, For those that are interested in the project to migrate the Handbook from the LinuxDoc DTD to DocBook, please read the following. It's my current plan for how to do this. Comment welcomed. N Handbook DTD Migration By Nik Clayton Email: nik@freebsd.org Contents 1 Background 2 Current Handbook layout 3 Translation 4 Branching the repository 5 Software requirements 6 Doing the conversion 6.1 Protect entities 6.2 Mechanically convert the Handbook to DocBook 7 Clean up the converted file 7.1 Reindent 7.2 Refill the paragraphs 7.3 Fix entries broken by the previous two changes 7.4 Replace comments 7.5 Remove empty markup 7.6 Fixup the markup choice 7.7 Fixup other errors 8 Split the Handbook into smaller files 8.1 Directory structure and file names 8.2 Add a DOCTYPE to the split files 8.3 Add a new entity file, _misc.sgml 8.4 Bring in the lost entities 9 Automate converting the Handbook to target formats 10 Merge changes from HEAD and commit, replacing the HEAD 11 If you want to help... NOTE: This document assumes that the reader has a basic understanding of SGML. Over the past six months or so a consensus has emerged that the FreeBSD Handbook (the Handbook) should be migrated from its existing DTD (LinuxDoc) to the DocBook DTD. This document outlines how that will be achieved. This is a work in progress, and comment is welcomed. It is also a lengthy document, but I'm trying to be as thorough as possible. 1 Background The FreeBSD Handbook is currently marked up using the LinuxDoc DTD. A variety of tools are then used to convert the Handbook to other formats, including HTML, ASCII text and Postscript. It is generally agreed that the LinuxDoc DTD is not up to the task of encoding the meaning of elements of the Handbook in sufficient detail. It has been decided that the Handbook should migrate to the DocBook DTD. DocBook is expressly designed for writing technical documentation such as the Handbook, and features a rich element set. It is also relatively easy to extend. As is often the case in volunteer projects such as FreeBSD (and Linux) no one has had the time to work through the issues involved in the migration, and then commit to being able to do the work. That's recently changed. I've just been able to commit a large chunk of spare time to this project and (in conjunction with John Fieber) have been working through the issues involved. The rest of this document aims to bring the interested reader up to speed on what's about to happen to the Handbook, and should provide sufficient detail for the interested SGML hacker to let me know what I've missed. 2 Current Handbook layout The Handbook is currently organised as a collection of files (with a .sgml extension) in one directory. Some of these files contain SGML entity definitions, and exist only to be included in the other SGML files. The other files form the chapters and sections of the handbook. Some chapters are entirely in one file, others are split and are stored in several files. This is (IMHO) mildly annoying. The migration process provides an opportunity to address this. 3 Translation The FreeBSD HandBook has been translated to Japanese, and the Japanese translators track changes to the Handbook and convert the changes by hand. In order to make their task easier during this migration, almost all the changes made will be automated, allowing the Japanese team to easily replicate them. In addition, no changes to the content of the Handbook will be made until after the Handbook has been converted to DocBook. 4 Branching the repository The migration process consists of a number of discrete steps. The state of the Handbook mid-way through these steps is not suitable for ``public consumption''. In addition, it is possible that unforeseen problems will occur during the migration. It would be possible to do all of the migration process ``offline'', and only commit the converted Handbook when the process was complete. However, this would deny the Japanese translators access to the diffs of the Handbook's state as it is converted. For these reasons, the conversion process must not happen on the HEAD of the CVS repository. Instead, the CVS doc repository will be branched with the tag ``LINUXDOC_2_DOCBOOK''. All the migration commits will happen along this branch. 5 Software requirements The conversion process requires the following applications and other pieces of software. Most of these are available in the FreeBSD ports collection. * LinuxDoc DTD (textproc/linuxdoc) * DocBook DTD (textproc/docbook) * jade and nsgmls (textproc/jade) * instant and the LinuxDoc to DocBook translation specification (textproc/sgmlformat) * perl for entity protection (lang/perl5) * xemacs and the psgml package to make editing the Handbook considerably simpler (editors/xemacs20). psgml also features commands to assist in the reformatting of the Handbook. * Norm Walsh's Modular DocBook stylesheet, available at http://www.berkshire.net/~norm/dsssl/docbook/. Norm has written DocBook stylesheets that can be used with Jade to transform text marked up in DocBook to HTML and RTF. These stylesheets will be used to convert the Handbook to HTML and RTF. From there (in the short term) the RTF can be converted to LaTeX and thence to Postscript. The HTML can be used to generate the plain text version of the Handbook. In the long term, the JadeTeX backend will be used to convert the Handbook to TeX, and from there to Postscript. 6 Doing the conversion Trying to convert the Handbook from LinuxDoc to DocBook on a file by file basis won't work. At least, not without an immense amount of grief. Instead, John Fieber's linuxdoc-docbook translation specification and instant will be used to do the conversion. However, the conversion process has a number of interesting wrinkles. 1. Entity definitions will be lost (see below for a workaround). 2. Comments in the individual source files will be lost. This is (potentially) a problem, since some of those comments contain copyright notices, reminders and so forth. They will need to be manually added back into the converted file. 3. Because the migration process is moving from a less expressive DTD to a more expressive DTD, the converted document will not take full advantage of the elements in the target DTD (DocBook). The translated document will need to be examined, and some markup substituted. 4. The result of the conversion process is one large file. This will need splitting up into smaller files. It is here that the opportunity to reorganise and rename the files that comprise the Handbook arises. 6.1 Protect entities The Handbook uses general entities to represent replaceable text. For example, &a.jkh; expands to Jordan's name and e-mail address. Unfortunately, the conversion process will cause all these entities to expand to their full representation, and the entity definition will be lost. The solution is to protect each entity, by making each entity refer to its own name, while storing its expanded form elsewhere. I have a relatively simple Perl script that does this. In essence, it looks for and converts it to which ensures that the process can be reveresed. A similar operation is performed on entities that refer to files in the SYSTEM, rather than containing CDATA. 6.2 Mechanically convert the Handbook to DocBook The Handbook can now be converted to the DocBook DTD. The command line to accomplish this is # nsgmls -c /usr/local/share/sgml/linuxdoc/catalog handbook.sgml | \ instant -t /usr/local/share/sgml/transpec/linuxdoc-docbook.t | \ sed -e 's/-entity-/\&/g' > handbook.docb The result of this conversion is a file containing syntactically valid but very ugly DocBook markup. This converted file is not yet ready to be converted to HTML. The one redeeming feature of this process is it is completely automatic. This should allow anyone working on the translation of the Handbook to do the same thing to the translated version and get the same results. 7 Clean up the converted file After the conversion, the Handbook SGML file will be cleaned up. Each one of these steps is a separate commit. 7.1 Reindent The file will be reindented. The easiest way to do this is to load the file into xemacs, activate sgml-mode and run the following function (which is not a part of psgml). (defun sgml-indent-buffer "Indents the current buffer, one line at a time" (interactive "*") (save-excursion (goto-char (point-min)) (while (= (forward-line 1) 0) (sgml-indent-or-tab)))) 7.2 Refill the paragraphs Again, the easiest way to refill (rewrap) the paragraphs in the file is to use sgml-mode. In this case, the cursor can be placed on the first element in the new Handbook (``book'') and M-x sgml-fill-element will be run. 7.3 Fix entries broken by the previous two changes The two previous changes have an undesirable side effect. Any content that is supposed to be rendered based on its layout in the SGML source will have had that layout corrupted. Examples of this include program listings, examples for the user to type in, PGP key blocks, and so on. These entries will need to be fixed up (reindented, and so on) by hand, based on their layout in the original handbook. Automating this: This cannot (currently) be automated. I'm talking with the author of PSGML, and the next version will have a mechanism to force sgml-fill-element to leave certain elements untouched. This version of PSGML is still in alpha test. However, if you're a LISP programmer with some spare time, I'd be grateful if you could take a look at PSGML and see if you can put together a quick hack. I think a check needs adding to the sgml-fill-element function that causes it to bail out early if the context for the current element is in a list of elements that should not be ``filled''. 7.4 Replace comments The conversion process strips out the comments from the source files. The existing source files will need to be examined, and the comments inserted into the new file. 7.5 Remove empty markup Due to formatting in the original Handbook, the conversion process writes a number of empty elements, mostly These will be removed. 7.6 Fixup the markup choice The conversion process selects sub-optimal markup in many cases. This is because the translation is from a less expressive DTD to a more expressive DTD. For example, the converted document has a slew of /a/file/name that should be converted to /a/file/name A cursory inspection of the converted file shows opportunities in the Handbook for filename, prompt, acronym, command, application, userinput and so on. There are also many sections designated as notes, which should be marked up using the appropriate element from note, warning, tip and so on. Each markup change will be a separate commit. So, all the changes to use the filename element will be made and committed, then all the changes to use prompt will be committed, and so on. 7.7 Fixup other errors The mechanical conversion introduces other errors that cause the converted handbook to fail to validate. These will be fixed on a case by case basis until the Handbook validates. 8 Split the Handbook into smaller files 8.1 Directory structure and file names I intend to split the Handbook into files organised along chapter lines. Each chapter in the Handbook will get its own directory, and the primary file for that chapter will be called chapter.sgml in its directory. The content for an individual chapter may be contained entirely within the chapter.sgml file. Or it might be separated out further along section lines. The biggest benefit of splitting out the chapters into individual directories is that it prepares the Handbook for the time when graphical content will be included in some chapters. Splitting the content into directories allows all the content for a particular chapter to be kept together. The top level handbook directory will contain handbook.sgml as it currently does. In addition, there will be other .sgml files. These files will contain entity definitions in the same way that authors.sgml, lists.sgml and sections.sgml do at the moment. To emphasise that these files contain content that will be used by other files, and are not directly usable themselves (they have no DOCTYPE) the filenames will start with a leading underscore. Current filename New filename --------------------------------- authors.sgml _authors.sgml lists.sgml _lists.sgml sections.sgml _chapters.sgml These entity files will also need converting to DocBook. 8.2 Add a DOCTYPE to the split files NOTE: I don't know if this is actually possible. I include it here so that someone with more SGML knowledge than myself can comment on the idea. With the Handbook as it stands at the moment, it's hard to work on a part of it and see the results of your change without rebuilding the entire handbook. This can take a while. At the moment, each file that comprises the Handbook can not be processed on its own, it has to be processed has a part of the entire Handbook. I'd like to change this so that you could do % cd /top/of/handbook % make ... this converts the entire Handbook ... % cd Introduction % make ... this just makes the Introduction chapter ... obviously, the results of just making the introduction would probably contain unresolved references to internal link targets, entities and so on. But it's only intended to allow an author to check their work in progress without needing to rebuild the whole Handbook. At first, I thought this could be accomplished by adding a DOCTYPE to the top of each chapter file. However, this fails when building the entire Handbook, since it then has multiple DOCTYPE entries. Instead, I thought something like the following could be used; ]]> to determine whether or not to include the DOCTYPE. But some simple tests show that this doesn't work either. As I say, comments from the SGML cognoscenti welcome. 8.3 Add a new entity file, _misc.sgml I think there's a place for a description of miscellanous entities to be used in the Handbook, mostly to help ensure consistency. Right now I only have two in mind, #"> %"> which would be used in all examples that needed to indicate whether the user was to perform a particular action as root or as a regular user. 8.4 Bring in the lost entities The current handbook.sgml includes some entity definitions at the top of the file that will be lost in the conversion process. They need to be added back in. 9 Automate converting the Handbook to target formats At a minimum the Handbook must be convertible to plain text, HTML and Postscript. Jade provides backends to convert to a number of other formats, including RTF, TeX and an SGML translation. Using Norm Walsh's stylesheets, HTML can be produced with % jade -t sgml -d /path/to/docbook.dsl handbook.sgml and RTF with % jade -t rtf -d /path/to/docbook.dsl handbook.sgml The use of the TeX backend is still being worked on. I plan on tracking down more information about this over the coming week. In order to make the Handbook buildable with easily available FreeBSD tools, a port will need to be made of both Norm Walsh's DocBook stylesheets and the JadeTeX macros. As normal, the Handbook will have a Makefile to help automate the conversion. Initially, I expect that this Makefile will be self contained rather than relying on bsd.sgml.mk. This is so that the generation of the FAQ (which will be a separate conversion project) can continue unchanged. Eventually, of course, both the FAQ and the Handbook will be marked up in DocBook, and then the common Makefile code can move to bsd.sgml.mk. 10 Merge changes from HEAD and commit, replacing the HEAD When the migration is complete, a diff of the Handbook on the HEAD will be taken, and any content changes that were made while the Handbook was migrated will be applied to the DocBook version (by hand). At this point the DocBook conversion of the Handbook can replace the current LinuxDoc version. 11 If you want to help... If you've read all of the above then firstly, my thanks. Secondly, if you've got any comments or suggestions, please feel free to make them. Thirdly, about the only thing in all this that I'm not completely sure about is the final conversion to TeX. In particular, I haven't experimented with the JadeTeX macros yet. If you have, or your pretty handy with TeX and fancy volunteering to answer some of my (quite possibly) silly questions, please step forward. -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message