From owner-freebsd-database Mon Mar 30 02:02:47 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id CAA06408 for freebsd-database-outgoing; Mon, 30 Mar 1998 02:02:47 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from tyree.iii.co.uk (tyree.iii.co.uk [195.89.149.230]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id CAA06402; Mon, 30 Mar 1998 02:02:44 -0800 (PST) (envelope-from nik@iii.co.uk) From: nik@iii.co.uk Received: from carrig.strand.iii.co.uk (carrig.strand.iii.co.uk [192.168.7.25]) by tyree.iii.co.uk (8.8.8/8.8.8) with ESMTP id LAA10864; Mon, 30 Mar 1998 11:02:14 +0100 (BST) Received: (from nik@localhost) by carrig.strand.iii.co.uk (8.8.8/8.8.7) id LAA06601; Mon, 30 Mar 1998 11:02:01 +0100 (BST) Message-ID: <19980330110200.17368@iii.co.uk> Date: Mon, 30 Mar 1998 11:02:00 +0100 To: shimon@simon-shapiro.org Cc: Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Mailing list search interface References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.85e In-Reply-To: ; from Simon Shapiro on Sun, Mar 29, 1998 at 01:57:30PM -0800 Organization: interactive investor Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk Gents, On Sun, Mar 29, 1998 at 01:57:30PM -0800, Simon Shapiro wrote: > On 26-Mar-98 Wolfram Schneider wrote: > > The FreeBSD mailing list search interface support threads. The > > thread database will be updated hourly. Of course there are > > many things to do to make the threads more user friendly. > > We have been playing with the idea of normalizing the archive into an > RDBMS. Some of the benefits are: Could we coordinate on some of this? I've been working on a system (at work) for making some of our mailing list archives visible and searchable on our internal site. I'm using MHonArc, Glimpse (both of which are in the ports tree) and a customised version of Wilma and it's almost at the point where this would be useful for the project. I mentioned MHonArc to Jordan, and his first response was > Eeek! The evil MHonArc resurfaces! ;-) > > It doesn't scale at all well - just try MHonArc'ing a really big mailing > list archive. You soon get a set of monster html files that are > essentially unusable - I know, I did the short-lived "FreeBSD Docs" > CD for awhile using MHonArc. I think he's been using an older version of MHonArc. I did some tests late last week, archiving and indexing the archives for -hackers from the beginning of 1998. That's 11,265K or thereabouts. At the end of the conversion (which consisted of running MHonArc 2.2.0 over the files, and then using Glimpse 4.1 to index them) I had a total of 32,910K HTML and index files. The output of 'time -l' on the conversion process was: 626.11 real 438.83 user 93.13 sys 8572 maximum resident set size 390 average shared memory size 4311 average unshared data size 128 average unshared stack size 1054806 page reclaims 68 page faults 0 swaps 9725 block input operations 6115 block output operations 0 messages sent 0 messages received 0 signals received 18065 voluntary context switches 26547 involuntary context switches That's a reasonably exceptional time, because it had to build the archive for the year to date, and you only take this hit once. Once the archive is up and running, you're only building HTML files for new messages since the last update, which is (or should be) considerably faster. Regrettably at the moment, there's a bug in Glimpse 4.1, which means that you need to reindex the entire archive, rather than just those bits that change. Fortunately, there are command line switches to tell the glimpseindex program how much memory to use. That 8572 max. resident size figure is from MHonArc rather than glimpse, since it reads in (as far as I can tell) the whole of the mail archive file before processing it. While the conversion was happening the load on my machine hovered around the .9-1.1 mark. With X, Netscape, XEmacs and a bunch of xterms open. At the end of the conversion process I had a threaded copy of the -hackers mail archives going back almost three months. Each month has two indices -- a date index where you see all the messages in the order they came in, and a threaded index. Each index shows (at most) 200 messages (that's a configurable number). This is so the size of the index files doesn't grow without end. Each index shows a "This is page x of y of the threaded index" comment, with navigation text to go backwards and forwards in the index. This whole thing is searchable, allowing searches by combination of keywords. You can specify the the number of misspellings to allow, the number of hits to return, case sensitivity, and which months to restrict your search to. The only thing you can't do (at the moment) is search across more than one mailing list. It shouldn't be too hard to add. Right now, I don't have a URL I can give to show you the results, since I ran out of time last night (I must be getting old, I used to be able to do 72 hour coding runs and not really feel it ). I should be able to get something demonstrable up on my freefall account by the middle of next week. In light of all that, do you think this is worth pursuing further? Thoughts? N -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message