Date: Mon, 30 Mar 1998 16:40:24 +0100 From: nik@iii.co.uk To: John Fieber <jfieber@indiana.edu> Cc: shimon@simon-shapiro.org, Wolfram Schneider <wosch@cs.tu-berlin.de>, freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami <asami@FreeBSD.ORG>, Amancio Hasty <hasty@rah.star-gate.com> Subject: Re: Mailing list search interface Message-ID: <19980330164024.47510@iii.co.uk> In-Reply-To: <Pine.BSF.3.96.980330091604.485T-100000@fallout.campusview.indiana.edu>; from John Fieber on Mon, Mar 30, 1998 at 09:48:45AM -0500 References: <19980330110200.17368@iii.co.uk> <Pine.BSF.3.96.980330091604.485T-100000@fallout.campusview.indiana.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Mar 30, 1998 at 09:48:45AM -0500, John Fieber wrote: > > At the end of the conversion (which consisted of running MHonArc 2.2.0 > > over the files, and then using Glimpse 4.1 to index them) I had a total > > of 32,910K HTML and index files. > > > > The output of 'time -l' on the conversion process was: > > > > 626.11 real 438.83 user 93.13 sys > > On what sort of hardware? 200 Mhz PPro w/64MB of RAM and 256MB of swap. At the time I was running XFree86 3.3.2, Netscape, Xemacs and a dozen or so xterms (tcsh, mutt, slrn). Load hovered around the .9-1.1 mark. Interactive response was fine. My disk is single 2GB Atlas II, with tagged queuing turned *off* (because of buggy firmware which I haven't updated yet). > By quick back-of-an-envelope calculations, this is slower than > the current indexing scheme on hub by at least a factor of 10. The time above was for creation of the HTML archives and for indexing, not just indexing alone. > Indexing anything large is typically an I/O bound operation and > when you start indexing much more than can fit in RAM, your > performance will degrade dramatically, so it is probably slower > by much more than a factor of 10. Don't know. I'll grab last years archive of -hackers (or another one, if there's another you think would be more representative) and try that. I can bring back figures for the time to create the entire archive (and index), the time just to index, and the time to add a new message and then reindex. I'd try this with the whole of the archives, but I don't have the spare disk space (yet). > Three months of -hackers != to 5 years of all the mailings lists. > I am confident that you will find that this scheme becomes a big > hairy hassle when you throw the whole thing at it. True enough. As I say, I'll try it and see. <snipped> > The ranking algorithm that Glimpse uses (or used last I checked) > is primative. (In an survey of what people liked, hated and most > wanted in the mailing list archives, people wanted thread > searching and date sorting, but only second and third *after* the > currently implemented ranking algorithm, which most people found > to work very well most of the time.) Are those survey results available online somewhere? > It isn't that things like MHonArc are not valliant efforts, but > they are merely refinemests of what is fundamentally a > quick-and-dirty, non-scalable solution. As I hinted in another > message, a proper solution would be based on a hybrid full > text/RDBMS. Whether a true hybrid system is built, or just the > illusion is built using some crafty CGI scripts is a detail to be > worked out. A hybrid system is on my list of things to build here (but it'll be Oracle based). I haven't investigated Postgres enough to know if it's up to the task. N -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980330164024.47510>