From owner-freebsd-database  Tue Mar 31 04:49:50 1998
Return-Path: <owner-freebsd-database@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id EAA08645
          for freebsd-database-outgoing; Tue, 31 Mar 1998 04:49:50 -0800 (PST)
          (envelope-from owner-freebsd-database@FreeBSD.ORG)
Received: from tyree.iii.co.uk (tyree.iii.co.uk [195.89.149.230])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id EAA08639
          for <freebsd-database@FreeBSD.ORG>; Tue, 31 Mar 1998 04:49:47 -0800 (PST)
          (envelope-from nik@iii.co.uk)
From: nik@iii.co.uk
Received: from carrig.strand.iii.co.uk (carrig.strand.iii.co.uk [192.168.7.25])
	by tyree.iii.co.uk (8.8.8/8.8.8) with ESMTP id NAA10043;
	Tue, 31 Mar 1998 13:49:45 +0100 (BST)
Received: (from nik@localhost)
	by carrig.strand.iii.co.uk (8.8.8/8.8.7) id NAA08533;
	Tue, 31 Mar 1998 13:49:29 +0100 (BST)
Message-ID: <19980331134928.03855@iii.co.uk>
Date: Tue, 31 Mar 1998 13:49:28 +0100
To: shimon@simon-shapiro.org
Cc: freebsd-database@FreeBSD.ORG
Subject: Re: Mailing list search interface
References: <19980330164024.47510@iii.co.uk> <XFMail.980330122500.shimon@simon-shapiro.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.85e
In-Reply-To: <XFMail.980330122500.shimon@simon-shapiro.org>; from Simon Shapiro on Mon, Mar 30, 1998 at 12:25:00PM -0800
Organization: interactive investor
Sender: owner-freebsd-database@FreeBSD.ORG
Precedence: bulk

On Mon, Mar 30, 1998 at 12:25:00PM -0800, Simon Shapiro wrote:
> > The time above was for creation of the HTML archives and for indexing,
> > not just indexing alone.
> 
> This is something we need to keep in mind.  Generating 100% output coverage
> for (probably) less than 10% need is wasteful.

True enough. 

I'm probably going to shut up in a second, because this is starting to 
get out of my depth. 

My background in this is as follows; at work I run a number of mailing 
lists (~1000 subscribers across all of them, and ~150 messages a day
tops, so this is not close to the level of the FreeBSD lists).

Getting a publically available archive has been on my to do list for a
while. The quick solution[1] has involved MHonArc, Glimpse and Wilma (as the
glue between the two of them). As a part of that implementation, I've had
to make a bunch of changes to Wilma to make it more generic and remove
some of the assumptions in the code.

At the back of my mind was the thought that this would be quite useful 
for the FreeBSD site, which is about where I started paying attention to
the messages in the pgaccess thread (!) about the mailing list search.

It certainly looks like Wolfram and co. have been making a number of 
changes to the list archive software to make it more useful (and as John
says, things like MHonArc and Glimpse will not scale well, although 
they're fine for my current problem)

> > I'd try this with the whole of the archives, but I don't have the spare
> > disk space (yet).
> 
> I have.  Is there an efficient way to get the whole archive here? 
> Downloading on a modem is NOT considered efficient.

Don't know. I FTP it down at work, drop it on a Zip disk and take it home.

> > A hybrid system is on my list of things to build here (but it'll be 
> > Oracle based). I haven't investigated Postgres enough to know if it's
> > up to the task.
> 
> Oracle based is good.  Now, plase tell us how to run Oracle on FreeBSD,
> legally, and with source available.

You can't (yet). But should I get the chance to implement something Oracle 
based at work, I should be able to at least apply that learning to 
reimplementing it in Postgres.

> PostgreSQL is up to the task.  This is not a dramatically complex database
> problem.  Pretty much a linear table, with the text searching TBD.

My initial thoughts on this (and I have done *very* little reading on the
subject at the moment) is to use the database only for the threading 
information. Store all the message information somewhere else, with an
efficient way to 

   a) retrieve a single message from the message archive (based on 
      a unique key, possibly the message-id (although I'm not 100%
      certain that that's guaranteed to be unique)).

   b) Search the entire message archive, returning the message keys that
      match the search.

In the (Oracle|Postgres) database, have one table per archive that has the
columns:

    ID          - Unique ID for this message
    PARENT      - ID of the parent message in this thread, NULL if this is
                  the first message in the thread.
    STRICT      - Is this in the thread because of strict threading
                  (a references or in-reply-to line was present in the
                  original message) or because the subject line matched?
    AUTHOR
    SUBJECT
   
Given a particular message ID, you could then construct the thread tree
from that message down (in Oracle) with SQL similar to:

       select ID, AUTHOR, SUBJECT
         from ARCHIVE_HACKERS
   connect by PARENT = prior ID
   start with ID = 'message_id';

I haven't tested that.

As I say, I'm still in the very early planning stages of this, and have a
lot of reading (and thinking) to do first.

N

[1] I don't like implementing 'quick solutions' any more than the rest of 
    you, but I was outvoted on this one.
-- 
Work: nik@iii.co.uk                       | FreeBSD + Perl + Apache
Rest: nik@nothing-going-on.demon.co.uk    | Remind me again why we need
Play: nik@freebsd.org                     | Microsoft?

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message