Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 6 Jul 1999 11:55:26 +0100
From:      Nik Clayton <nclayton@lehman.com>
To:        chris@calldei.com, Bill Fumerola <billf@chc-chimes.com>, doc@freebsd.org
Cc:        hackers@freebsd.org
Subject:   Searching the Handbook (was Re: 'rtfm script')
Message-ID:  <19990706115526.Z15628@lehman.com>
In-Reply-To: <19990705141635.D97224@holly.dyndns.org>; from Chris Costello on Mon, Jul 05, 1999 at 02:16:36PM -0500
References:  <Pine.HPP.3.96.990705100523.26110A-100000@hp9000.chc-chimes.com> <19990705141635.D97224@holly.dyndns.org>

next in thread | previous in thread | raw e-mail | index | archive | help
I've added doc@freebsd.org to the distribution list, for obvious reasons.

On Mon, Jul 05, 1999 at 02:16:36PM -0500, Chris Costello wrote:
> On Mon, Jul 5, 1999, Bill Fumerola wrote:
> > I'm in favor of the rtfm script. It's amazing the positive
> > things that come out of an offhand IRC comment.
> > 
> > [ from http://www.emsphone.com/FreeBSD/log.cgi/19990704.txt ]
> > 
> > [15:33] <cmc> First it'll search the man pages.  Then the handbook.  Then
> > the FAQ.  Then, maybe see if I can find out if they start bitching, and if
> > so, email Jesus Monroy.
> 
>    Note that I can't figure out a decent way to search the
> Handbook at this point, but I'm open to ideas.

There are a couple of ways you could do it.  Some of them more optimal 
than others.

   Executive summary:  sgrep is probably your best choice now, which can
   can be found at <URL:http://www.cs.helsinki.fi/~jjaakkol/sgrep.html>. 
   But read on for more.

The simplest way is to assume that the user has the plain text handbook
installed, and do a simple grep through that for what you're looking for.

This is nice and easy to do, but doesn't take advantage of the additional
smarts built in to the Handbook's native format.  To do that requires some
additional work.

A brief recap for those not au fait with how the Handbook is organised in
source form. 

  The Handbook is 'marked up' in a language called DocBook.  DocBook was
  designed specifically for formatting technical documentation, and looks
  a lot like HTML.  However, instead of tags like <em>, <b>, <ul>, and so
  on, DocBook has tags like <example>, <screen>, <userinput>, <devicename>,
  <filename>, and so forth.

  A document that is marked up in DocBook therefore contains a lot of 
  additional semantic information about the content (and very little 
  formatting information).

  When the Handbook is converted to HTML, some of this semantic information
  is retained.  For example, the DocBook source for an example that the
  user might want to copy verbatim would look like,

      <screen><prompt>#</prompt> <userinput>rm -rf /</userinput></screen>

  and might be converted to HTML that looks like

      <blockquote class="screen">
        <tt><span class="prompt">#</span> <span 
          class="userinput">rm -rf /</span></tt>
      </blockquote>

  Lots more information can be found at 
  http://www.freebsd.org/tutorials/docproj-primer/.

A smart searching mechanism will be able to use this additional semantic
information to reject (or lower the rankings of) results that don't match
what the user wanted.

For example, suppose you're searching the Handbook for examples of the 
make(1) command in action.  The simple string "make" occurs lots of times
in the Handbook.  However, you're only interested in those sections where
it occurs *inside* a <userinput> element; all the other occurences can be
ignored.

For a simple rtfm(1) style search most of this can probably be ignored, and
you can just search the plain text handbook.  But even then you might want
to provide switches that allow the user to specify:

  -  Only match this word if found in an example

  -  Only match this word if found in a title

  -  Only match this word if found in a command name

and so on.

How do you do that?  Good question.  This has been on my list of things
to investigate (at the back of my mind) for a while, but more important
things have taken my time.  If anyone's interested in doing this, here's
what I've discovered.

You could go the full SGML route.  This would involve building an 
application that can parse the DocBook source of the Handbook (and other
articles, and soon to be the FAQ) and allow the user to do their queries
using this application.  This is probably the most 'correct' route from
a purist point of view, but is an awful lot of work.

You could go the XML route.  XML is the buzzword of the moment, can be 
thought of as being SGML-Lite.  Writing an XML parser is much easier than
writing an SGML parser, and you could write an XML aware application could
parse the Handbook and other docs, returning results that only appeared
inside certain elements.  This is still a chunk of work, and the end user
will need to keep an XML copy of the documentation somewhere on their disk.
Converting from SGML to XML is not a hard problem for our documents though,
so at least that hurdle is skipped.

For an example of this, check out SCOOBS, at <URL:http://www.scoobs.com/>.
This is still probably too heavyweight a solution though.

*Much* simpler is to build a grep-alike that understands structured 
documents, but that doesn't care how those documents are structured.  This
is such a great idea that someone's already done it -- sgrep, which can
be found at <URL:http://www.cs.helsinki.fi/~jjaakkol/sgrep.html>; can 
search structured text (such as DocBook, HTML, or even mail files).

Some examples of sgrep queries;

    sgrep 'start or "\n" .. (end or "\n") containing "Hello World"'

You can define macros in sgrep, so the above could be simplified to

    sgrep 'LINE containing "Hello World"'

If you wanted to find all the From: fields in a Unix mbox file;

    sgrep '"\nFrom: " .. "\n" extracting ("\n" in "\nFrom: ")'

or with macros

    sgrep 'MAIL_FROM'

Print out the title from a collection of HTML documents in which the word
"SGML" is mentioned more than 12 times, or which have the word "SGML" 
inside H1 or H2 elements;

    sgrep 'HTML_TITLE in (start .. end containing (\
        join(12,"SGML") or (HTML_H1 or HTML_H2 containing "SGML") ) )' *.html

rtfm(1) could provide a simpler front-end to a series of canned sgrep 
searches, depending on switches passed to rtfm(1).

As you can probably tell, I'm in favour of the sgrep(1) approach, simply
because you'll get something working much faster.  

One caveat though -- the sgrep query language is not standard, and is only
implemented by sgrep.  There is a proposal going through for something 
called XQL, the XML Query Language.  In the long run, something that
supported searching using XQL is likely to be most useful.  But in the 
short-term, sgrep will probably get you up and running quickly.

More information about XQL can be found at
<URL:http://www.w3.org/TandS/QL/QL98/pp/xql.html>.  If you do a search
for "xql" at Google (<URL:http://www.google.com/>) then you'll turn up 
all sorts of goodies, including various Perl and Python interfaces to 
XQL, which might make writing an XQL search system easier.

HTH,

N
-- 
--+==[ Systems Administrator, Year 2000 Test Lab, Lehman Brothers, Inc. ]==+--
--+==[      1 Broadgate, London, EC2M 7HA     0171-601-0011 x5514       ]==+--
--+==[              Year 2000 Testing: It's about time. . .             ]==+--


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19990706115526.Z15628>