From owner-freebsd-database Tue Mar 31 04:49:50 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id EAA08645 for freebsd-database-outgoing; Tue, 31 Mar 1998 04:49:50 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from tyree.iii.co.uk (tyree.iii.co.uk [195.89.149.230]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id EAA08639 for ; Tue, 31 Mar 1998 04:49:47 -0800 (PST) (envelope-from nik@iii.co.uk) From: nik@iii.co.uk Received: from carrig.strand.iii.co.uk (carrig.strand.iii.co.uk [192.168.7.25]) by tyree.iii.co.uk (8.8.8/8.8.8) with ESMTP id NAA10043; Tue, 31 Mar 1998 13:49:45 +0100 (BST) Received: (from nik@localhost) by carrig.strand.iii.co.uk (8.8.8/8.8.7) id NAA08533; Tue, 31 Mar 1998 13:49:29 +0100 (BST) Message-ID: <19980331134928.03855@iii.co.uk> Date: Tue, 31 Mar 1998 13:49:28 +0100 To: shimon@simon-shapiro.org Cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface References: <19980330164024.47510@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.85e In-Reply-To: ; from Simon Shapiro on Mon, Mar 30, 1998 at 12:25:00PM -0800 Organization: interactive investor Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, Mar 30, 1998 at 12:25:00PM -0800, Simon Shapiro wrote: > > The time above was for creation of the HTML archives and for indexing, > > not just indexing alone. > > This is something we need to keep in mind. Generating 100% output coverage > for (probably) less than 10% need is wasteful. True enough. I'm probably going to shut up in a second, because this is starting to get out of my depth. My background in this is as follows; at work I run a number of mailing lists (~1000 subscribers across all of them, and ~150 messages a day tops, so this is not close to the level of the FreeBSD lists). Getting a publically available archive has been on my to do list for a while. The quick solution[1] has involved MHonArc, Glimpse and Wilma (as the glue between the two of them). As a part of that implementation, I've had to make a bunch of changes to Wilma to make it more generic and remove some of the assumptions in the code. At the back of my mind was the thought that this would be quite useful for the FreeBSD site, which is about where I started paying attention to the messages in the pgaccess thread (!) about the mailing list search. It certainly looks like Wolfram and co. have been making a number of changes to the list archive software to make it more useful (and as John says, things like MHonArc and Glimpse will not scale well, although they're fine for my current problem) > > I'd try this with the whole of the archives, but I don't have the spare > > disk space (yet). > > I have. Is there an efficient way to get the whole archive here? > Downloading on a modem is NOT considered efficient. Don't know. I FTP it down at work, drop it on a Zip disk and take it home. > > A hybrid system is on my list of things to build here (but it'll be > > Oracle based). I haven't investigated Postgres enough to know if it's > > up to the task. > > Oracle based is good. Now, plase tell us how to run Oracle on FreeBSD, > legally, and with source available. You can't (yet). But should I get the chance to implement something Oracle based at work, I should be able to at least apply that learning to reimplementing it in Postgres. > PostgreSQL is up to the task. This is not a dramatically complex database > problem. Pretty much a linear table, with the text searching TBD. My initial thoughts on this (and I have done *very* little reading on the subject at the moment) is to use the database only for the threading information. Store all the message information somewhere else, with an efficient way to a) retrieve a single message from the message archive (based on a unique key, possibly the message-id (although I'm not 100% certain that that's guaranteed to be unique)). b) Search the entire message archive, returning the message keys that match the search. In the (Oracle|Postgres) database, have one table per archive that has the columns: ID - Unique ID for this message PARENT - ID of the parent message in this thread, NULL if this is the first message in the thread. STRICT - Is this in the thread because of strict threading (a references or in-reply-to line was present in the original message) or because the subject line matched? AUTHOR SUBJECT Given a particular message ID, you could then construct the thread tree from that message down (in Oracle) with SQL similar to: select ID, AUTHOR, SUBJECT from ARCHIVE_HACKERS connect by PARENT = prior ID start with ID = 'message_id'; I haven't tested that. As I say, I'm still in the very early planning stages of this, and have a lot of reading (and thinking) to do first. N [1] I don't like implementing 'quick solutions' any more than the rest of you, but I was outvoted on this one. -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message