Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Mar 1998 16:40:24 +0100
From:      nik@iii.co.uk
To:        John Fieber <jfieber@indiana.edu>
Cc:        shimon@simon-shapiro.org, Wolfram Schneider <wosch@cs.tu-berlin.de>, freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami <asami@FreeBSD.ORG>, Amancio Hasty <hasty@rah.star-gate.com>
Subject:   Re: Mailing list search interface
Message-ID:  <19980330164024.47510@iii.co.uk>
In-Reply-To: <Pine.BSF.3.96.980330091604.485T-100000@fallout.campusview.indiana.edu>; from John Fieber on Mon, Mar 30, 1998 at 09:48:45AM -0500
References:  <19980330110200.17368@iii.co.uk> <Pine.BSF.3.96.980330091604.485T-100000@fallout.campusview.indiana.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Mar 30, 1998 at 09:48:45AM -0500, John Fieber wrote:
> > At the end of the conversion (which consisted of running MHonArc 2.2.0
> > over the files, and then using Glimpse 4.1 to index them) I had a total
> > of 32,910K HTML and index files.
> > 
> > The output of 'time -l' on the conversion process was:
> >
> >       626.11 real       438.83 user        93.13 sys
> 
> On what sort of hardware?

200 Mhz PPro w/64MB of RAM and 256MB of swap. At the time I was running
XFree86 3.3.2, Netscape, Xemacs and a dozen or so xterms (tcsh, mutt, 
slrn). Load hovered around the .9-1.1 mark. Interactive response was fine.

My disk is single 2GB Atlas II, with tagged queuing turned *off* (because 
of buggy firmware which I haven't updated yet).

> By quick back-of-an-envelope calculations, this is slower than
> the current indexing scheme on hub by at least a factor of 10.

The time above was for creation of the HTML archives and for indexing,
not just indexing alone.

> Indexing anything large is typically an I/O bound operation and
> when you start indexing much more than can fit in RAM, your
> performance will degrade dramatically, so it is probably slower
> by much more than a factor of 10.

Don't know. I'll grab last years archive of -hackers (or another one,
if there's another you think would be more representative) and try that.
I can bring back figures for the time to create the entire archive (and
index), the time just to index, and the time to add a new message and
then reindex.

I'd try this with the whole of the archives, but I don't have the spare
disk space (yet).

> Three months of -hackers != to 5 years of all the mailings lists. 
> I am confident that you will find that this scheme becomes a big
> hairy hassle when you throw the whole thing at it.  

True enough. As I say, I'll try it and see. 

<snipped>

> The ranking algorithm that Glimpse uses (or used last I checked)
> is primative. (In an survey of what people liked, hated and most
> wanted in the mailing list archives, people wanted thread
> searching and date sorting, but only second and third *after* the
> currently implemented ranking algorithm, which most people found
> to work very well most of the time.)

Are those survey results available online somewhere?

> It isn't that things like MHonArc are not valliant efforts, but
> they are merely refinemests of what is fundamentally a
> quick-and-dirty, non-scalable solution.  As I hinted in another
> message, a proper solution would be based on a hybrid full
> text/RDBMS.  Whether a true hybrid system is built, or just the
> illusion is built using some crafty CGI scripts is a detail to be
> worked out. 

A hybrid system is on my list of things to build here (but it'll be 
Oracle based). I haven't investigated Postgres enough to know if it's
up to the task.

N
-- 
Work: nik@iii.co.uk                       | FreeBSD + Perl + Apache
Rest: nik@nothing-going-on.demon.co.uk    | Remind me again why we need
Play: nik@freebsd.org                     | Microsoft?

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980330164024.47510>