Date: Mon, 30 Mar 1998 10:26:29 -0800 (PST) From: Simon Shapiro <shimon@simon-shapiro.org> To: nik@iii.co.uk Cc: Amancio Hasty <hasty@rah.star-gate.com>, Satoshi Asami <asami@FreeBSD.ORG>, scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider <wosch@cs.tu-berlin.de> Subject: RE: Mailing list search interface Message-ID: <XFMail.980330102629.shimon@simon-shapiro.org> In-Reply-To: <19980330110200.17368@iii.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
I have no strong opinions in this matter. My experience with indexing methods residing in (essentially) flat Unix files is that they do not scale well. This is what database engines are for. Truth must be told, currently PostgreSQL uses Unix files to store its indices and tables, so performance is not all that it could be. I am working on building a raw device storage manager for PostgreSQL, which will allow shared access (cluster like) and much faster speed. The only issue I have not settled on is how to search the message bodies. Maybe I get some free time soon and will try few things. What is an acceptable search rate? For header type data? For body regex? BTW, if your project is almost ready, go ahead with it. It does not conflict at all with what I am thinking of. On 30-Mar-98 nik@iii.co.uk wrote: > Gents, > > On Sun, Mar 29, 1998 at 01:57:30PM -0800, Simon Shapiro wrote: >> On 26-Mar-98 Wolfram Schneider wrote: >> > The FreeBSD mailing list search interface support threads. The >> > thread database will be updated hourly. Of course there are >> > many things to do to make the threads more user friendly. >> >> We have been playing with the idea of normalizing the archive into an >> RDBMS. Some of the benefits are: > > <snip> > > Could we coordinate on some of this? I've been working on a system (at > work) for making some of our mailing list archives visible and searchable > on our internal site. I'm using MHonArc, Glimpse (both of which are in > the ports tree) and a customised version of Wilma > > <URL:http://www.hpc.uh.edu/majordomo/#wilma> > > and it's almost at the point where this would be useful for the project. > > I mentioned MHonArc to Jordan, and his first response was > >> Eeek! The evil MHonArc resurfaces! ;-) >> >> It doesn't scale at all well - just try MHonArc'ing a really big mailing >> list archive. You soon get a set of monster html files that are >> essentially unusable - I know, I did the short-lived "FreeBSD Docs" >> CD for awhile using MHonArc. > > I think he's been using an older version of MHonArc. I did some tests > late last week, archiving and indexing the archives for -hackers from > the beginning of 1998. That's 11,265K or thereabouts. > > At the end of the conversion (which consisted of running MHonArc 2.2.0 > over the files, and then using Glimpse 4.1 to index them) I had a total > of 32,910K HTML and index files. > > The output of 'time -l' on the conversion process was: > > 626.11 real 438.83 user 93.13 sys > 8572 maximum resident set size > 390 average shared memory size > 4311 average unshared data size > 128 average unshared stack size > 1054806 page reclaims > 68 page faults > 0 swaps > 9725 block input operations > 6115 block output operations > 0 messages sent > 0 messages received > 0 signals received > 18065 voluntary context switches > 26547 involuntary context switches > > That's a reasonably exceptional time, because it had to build the archive > for the year to date, and you only take this hit once. Once the archive > is up and running, you're only building HTML files for new messages since > the last update, which is (or should be) considerably faster. > > Regrettably at the moment, there's a bug in Glimpse 4.1, which means that > you need to reindex the entire archive, rather than just those bits that > change. Fortunately, there are command line switches to tell the > glimpseindex program how much memory to use. > > That 8572 max. resident size figure is from MHonArc rather than glimpse, > since it reads in (as far as I can tell) the whole of the mail archive > file before processing it. > > While the conversion was happening the load on my machine hovered around > the .9-1.1 mark. With X, Netscape, XEmacs and a bunch of xterms open. > > At the end of the conversion process I had a threaded copy of the > -hackers > mail archives going back almost three months. > > Each month has two indices -- a date index where you see all the messages > in the order they came in, and a threaded index. > > Each index shows (at most) 200 messages (that's a configurable number). > This is so the size of the index files doesn't grow without end. Each > index shows a "This is page x of y of the threaded index" comment, with > navigation text to go backwards and forwards in the index. > > This whole thing is searchable, allowing searches by combination of > keywords. You can specify the the number of misspellings to allow, the > number of hits to return, case sensitivity, and which months to restrict > your search to. > > The only thing you can't do (at the moment) is search across more than > one > mailing list. It shouldn't be too hard to add. Right now, I don't have a > URL I can give to show you the results, since I ran out of time last > night > (I must be getting old, I used to be able to do 72 hour coding runs and > not > really feel it <sigh>). I should be able to get something demonstrable > up on my freefall account by the middle of next week. > > In light of all that, do you think this is worth pursuing further? > > Thoughts? > > N > -- > Work: nik@iii.co.uk | FreeBSD + Perl + Apache > Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need > Play: nik@freebsd.org | Microsoft? ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.980330102629.shimon>