From owner-freebsd-database Sun Mar 29 20:10:57 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA18878 for freebsd-database-outgoing; Sun, 29 Mar 1998 20:10:57 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id UAA18715 for ; Sun, 29 Mar 1998 20:10:16 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 8431 invoked from network); 30 Mar 1998 01:32:38 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 01:32:38 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Sun, 29 Mar 1998 17:32:38 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: The Hermit Hacker Subject: Re: PgAccess small bug fix Cc: PostgreSQL Interfaces Cc: PostgreSQL Interfaces , freebsd-database@FreeBSD.ORG Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 The Hermit Hacker wrote: .. > I don't touch the FreeBSD ports collection, mainly because I > haven't had time to figure it out :( The only CVS repository that I > update as far as PostgreSQL is concerned is the PostgreSQL one... The ports collection is actually very simple and very nice. took me all but one day to learn it. The .mk files driving it are a different story altogether :-) I will remember that. Thanx. Will pull it out soon. What implies from your note that it is in the cvsup server sort of contradicts the FreeBSd porter who states that pgacess, separate from postgresql, is frsher, newer, and better. If he is wrong (could be), then we might want to merge the two packages. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Sun Mar 29 20:18:12 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id TAA09088 for freebsd-database-outgoing; Sun, 29 Mar 1998 19:46:53 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id TAA08734 for ; Sun, 29 Mar 1998 19:46:02 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 29533 invoked from network); 30 Mar 1998 02:54:52 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 02:54:52 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Sun, 29 Mar 1998 18:54:52 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: The Hermit Hacker Subject: Re: PgAccess small bug fix Cc: freebsd-database@FreeBSD.ORG Cc: freebsd-database@FreeBSD.ORG, PostgreSQL Interfaces Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 The Hermit Hacker wrote: .. > Nope...pgaccess is found at www.flex.ro/pgaccess(?) ... we have > it included, also, as part of our src/bin directory, but, for the > release, > tha only contains the newest version *at the time of* release... OK, the release is infrequent. > We update the CVSup server at www.postgresql.org periodically to > keep up with Constantin's work, but it doesn't get udpated in the > 'release' distribution but the ``-current'' is timely. Good. Thanx. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Sun Mar 29 20:39:41 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA01487 for freebsd-database-outgoing; Sun, 29 Mar 1998 20:39:41 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id UAA01434 for ; Sun, 29 Mar 1998 20:39:31 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 21226 invoked from network); 29 Mar 1998 21:41:55 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 29 Mar 1998 21:41:55 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <351A8E46.2B097E89@flex.ro> Date: Sun, 29 Mar 1998 13:41:54 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: Constantin Teodorescu Subject: Re: PgAccess small bug fix Cc: scrappy@hub.org, pgsql-ports@postgresql.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, hasty@rah.star-gate.com, PostgreSQL Interfaces Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 26-Mar-98 Constantin Teodorescu wrote: >> > When "Query design" window is on the screen, "exec" queries cannot be >> > executed. >> > Save the query and then "Open" it. It will exec the query. > > Not any more :-) > > The new version, 0.85 that I have fixed this evening takes the > appropriate action for "Execute query" button in the design mode. > > If it is a "SELECT ..." it will open the query/table viewer, if it's an > action query (Update, Delete, Grant, Insert ...) it will "pg_exec" the > query. I am a bit confused. Where is this new version available form? Will it be in the PostgreSQL cvsup tree? Do you maintain it elsewhere? For me a cvsup to pgsql is the best. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Sun Mar 29 20:41:27 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA02113 for freebsd-database-outgoing; Sun, 29 Mar 1998 20:41:27 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from thelab.hub.org (tc-43.acadiau.ca [131.162.2.143]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA02066 for ; Sun, 29 Mar 1998 20:41:17 -0800 (PST) (envelope-from scrappy@hub.org) Received: from localhost (scrappy@localhost) by thelab.hub.org (8.8.8/8.8.2) with SMTP id WAA00508; Sun, 29 Mar 1998 22:01:26 -0400 (AST) X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs Date: Sun, 29 Mar 1998 22:01:26 -0400 (AST) From: The Hermit Hacker To: Simon Shapiro cc: PostgreSQL Interfaces , PostgreSQL Interfaces , freebsd-database@FreeBSD.ORG Subject: Re: PgAccess small bug fix In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Sun, 29 Mar 1998, Simon Shapiro wrote: > > On 30-Mar-98 The Hermit Hacker wrote: > > .. > > > I don't touch the FreeBSD ports collection, mainly because I > > haven't had time to figure it out :( The only CVS repository that I > > update as far as PostgreSQL is concerned is the PostgreSQL one... > > The ports collection is actually very simple and very nice. took me all > but one day to learn it. The .mk files driving it are a different story > altogether :-) > > I will remember that. Thanx. Will pull it out soon. > > What implies from your note that it is in the cvsup server sort of > contradicts the FreeBSd porter who states that pgacess, separate from > postgresql, is frsher, newer, and better. If he is wrong (could be), then > we might want to merge the two packages. Nope...pgaccess is found at www.flex.ro/pgaccess(?) ... we have it included, also, as part of our src/bin directory, but, for the release, tha only contains the newest version *at the time of* release... We update the CVSup server at www.postgresql.org periodically to keep up with Constantin's work, but it doesn't get udpated in the 'release' distribution Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Sun Mar 29 20:41:45 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA02190 for freebsd-database-outgoing; Sun, 29 Mar 1998 20:41:45 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from thelab.hub.org (tc-43.acadiau.ca [131.162.2.143]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA02091 for ; Sun, 29 Mar 1998 20:41:23 -0800 (PST) (envelope-from scrappy@hub.org) Received: from localhost (scrappy@localhost) by thelab.hub.org (8.8.8/8.8.2) with SMTP id VAA00398; Sun, 29 Mar 1998 21:04:56 -0400 (AST) X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs Date: Sun, 29 Mar 1998 21:04:55 -0400 (AST) From: The Hermit Hacker To: Simon Shapiro cc: freebsd-database@FreeBSD.ORG, PostgreSQL Interfaces Subject: Re: PgAccess small bug fix In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Sun, 29 Mar 1998, Simon Shapiro wrote: > > On 29-Mar-98 The Hermit Hacker wrote: > ... > > >> I am a bit confused. Where is this new version available form? Will it > >> be in the PostgreSQL cvsup tree? Do you maintain it elsewhere? For me a > >> cvsup to pgsql is the best. > > > > Updating the CVS repository right now...check it in the next hour > > or so... > > Hate to be a nag. Postgres CVS, or FreeBSD CVS. for Postgres stuff, I > prefer Postgres CVS :-) I don't touch the FreeBSD ports collection, mainly because I haven't had time to figure it out :( The only CVS repository that I update as far as PostgreSQL is concerned is the PostgreSQL one... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Sun Mar 29 20:41:49 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA02206 for freebsd-database-outgoing; Sun, 29 Mar 1998 20:41:49 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from thelab.hub.org (tc-43.acadiau.ca [131.162.2.143]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA02160 for ; Sun, 29 Mar 1998 20:41:40 -0800 (PST) (envelope-from scrappy@hub.org) Received: from localhost (scrappy@localhost) by thelab.hub.org (8.8.8/8.8.2) with SMTP id RAA07274; Sun, 29 Mar 1998 17:44:53 -0400 (AST) X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs Date: Sun, 29 Mar 1998 17:44:50 -0400 (AST) From: The Hermit Hacker To: Simon Shapiro cc: freebsd-database@FreeBSD.ORG, PostgreSQL Interfaces Subject: Re: PgAccess small bug fix In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Sun, 29 Mar 1998, Simon Shapiro wrote: > > On 26-Mar-98 Constantin Teodorescu wrote: > >> > When "Query design" window is on the screen, "exec" queries cannot be > >> > executed. > >> > Save the query and then "Open" it. It will exec the query. > > > > Not any more :-) > > > > The new version, 0.85 that I have fixed this evening takes the > > appropriate action for "Execute query" button in the design mode. > > > > If it is a "SELECT ..." it will open the query/table viewer, if it's an > > action query (Update, Delete, Grant, Insert ...) it will "pg_exec" the > > query. > > I am a bit confused. Where is this new version available form? Will it be > in the PostgreSQL cvsup tree? Do you maintain it elsewhere? For me a > cvsup to pgsql is the best. Updating the CVS repository right now...check it in the next hour or so... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Sun Mar 29 20:55:08 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA06285 for freebsd-database-outgoing; Sun, 29 Mar 1998 20:55:08 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id UAA06269 for ; Sun, 29 Mar 1998 20:55:03 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 21426 invoked from network); 29 Mar 1998 21:57:30 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 29 Mar 1998 21:57:30 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Sun, 29 Mar 1998 13:57:30 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: Wolfram Schneider Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, (Satoshi Asami) , Amancio Hasty Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 26-Mar-98 Wolfram Schneider wrote: .. > The FreeBSD mailing list search interface support threads. The > thread database will be updated hourly. Of course there are > many things to do to make the threads more user friendly. We have been playing with the idea of normalizing the archive into an RDBMS. Some of the benefits are: * no need to update the threads database. It will always be updated. * Users can create, easily, their own thread logic with no impact on system performance. * Searching on normalized fields are many times faster, and much less costly in system resources. What I am missing, is the definition of what functionality is neccesary/existing/desired. The only information I have is from observing the UI and guessing what common sense would dictate. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Sun Mar 29 21:27:00 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id VAA11842 for freebsd-database-outgoing; Sun, 29 Mar 1998 21:27:00 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id VAA11835 for ; Sun, 29 Mar 1998 21:26:57 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 21799 invoked from network); 29 Mar 1998 22:29:24 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 29 Mar 1998 22:29:24 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Sun, 29 Mar 1998 14:29:24 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: The Hermit Hacker Subject: Re: PgAccess small bug fix Cc: freebsd-database@FreeBSD.ORG Cc: freebsd-database@FreeBSD.ORG, PostgreSQL Interfaces Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 29-Mar-98 The Hermit Hacker wrote: ... >> I am a bit confused. Where is this new version available form? Will it >> be in the PostgreSQL cvsup tree? Do you maintain it elsewhere? For me a >> cvsup to pgsql is the best. > > Updating the CVS repository right now...check it in the next hour > or so... Hate to be a nag. Postgres CVS, or FreeBSD CVS. for Postgres stuff, I prefer Postgres CVS :-) ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 01:02:10 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id BAA29345 for freebsd-database-outgoing; Mon, 30 Mar 1998 01:02:10 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from isbalham.ist.co.uk (isbalham.ist.co.uk [192.31.26.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id BAA29337 for ; Mon, 30 Mar 1998 01:02:08 -0800 (PST) (envelope-from rb@gid.co.uk) Received: from gid.co.uk (uucp@localhost) by isbalham.ist.co.uk (8.8.7/8.8.4) with UUCP id KAA12423; Mon, 30 Mar 1998 10:00:49 +0100 (BST) Received: from [194.32.164.2] by seagoon.gid.co.uk; Mon, 30 Mar 1998 09:54:00 +0100 (BST) X-Sender: rb@194.32.164.1 Message-Id: In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 30 Mar 1998 09:50:48 +0100 To: shimon@simon-shapiro.org, Wolfram Schneider From: Bob Bishop Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, (Satoshi Asami) , Amancio Hasty Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk At 10:57 pm +0100 29/3/98, Simon Shapiro wrote: >.. >We have been playing with the idea of normalizing the archive into an >RDBMS. [etc] Great, but to be useful you need free-text search too. -- Bob Bishop (0118) 977 4017 international code +44 118 rb@gid.co.uk fax (0118) 989 4254 between 0800 and 1800 UK To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 02:02:47 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id CAA06408 for freebsd-database-outgoing; Mon, 30 Mar 1998 02:02:47 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from tyree.iii.co.uk (tyree.iii.co.uk [195.89.149.230]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id CAA06402; Mon, 30 Mar 1998 02:02:44 -0800 (PST) (envelope-from nik@iii.co.uk) From: nik@iii.co.uk Received: from carrig.strand.iii.co.uk (carrig.strand.iii.co.uk [192.168.7.25]) by tyree.iii.co.uk (8.8.8/8.8.8) with ESMTP id LAA10864; Mon, 30 Mar 1998 11:02:14 +0100 (BST) Received: (from nik@localhost) by carrig.strand.iii.co.uk (8.8.8/8.8.7) id LAA06601; Mon, 30 Mar 1998 11:02:01 +0100 (BST) Message-ID: <19980330110200.17368@iii.co.uk> Date: Mon, 30 Mar 1998 11:02:00 +0100 To: shimon@simon-shapiro.org Cc: Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Mailing list search interface References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.85e In-Reply-To: ; from Simon Shapiro on Sun, Mar 29, 1998 at 01:57:30PM -0800 Organization: interactive investor Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk Gents, On Sun, Mar 29, 1998 at 01:57:30PM -0800, Simon Shapiro wrote: > On 26-Mar-98 Wolfram Schneider wrote: > > The FreeBSD mailing list search interface support threads. The > > thread database will be updated hourly. Of course there are > > many things to do to make the threads more user friendly. > > We have been playing with the idea of normalizing the archive into an > RDBMS. Some of the benefits are: Could we coordinate on some of this? I've been working on a system (at work) for making some of our mailing list archives visible and searchable on our internal site. I'm using MHonArc, Glimpse (both of which are in the ports tree) and a customised version of Wilma and it's almost at the point where this would be useful for the project. I mentioned MHonArc to Jordan, and his first response was > Eeek! The evil MHonArc resurfaces! ;-) > > It doesn't scale at all well - just try MHonArc'ing a really big mailing > list archive. You soon get a set of monster html files that are > essentially unusable - I know, I did the short-lived "FreeBSD Docs" > CD for awhile using MHonArc. I think he's been using an older version of MHonArc. I did some tests late last week, archiving and indexing the archives for -hackers from the beginning of 1998. That's 11,265K or thereabouts. At the end of the conversion (which consisted of running MHonArc 2.2.0 over the files, and then using Glimpse 4.1 to index them) I had a total of 32,910K HTML and index files. The output of 'time -l' on the conversion process was: 626.11 real 438.83 user 93.13 sys 8572 maximum resident set size 390 average shared memory size 4311 average unshared data size 128 average unshared stack size 1054806 page reclaims 68 page faults 0 swaps 9725 block input operations 6115 block output operations 0 messages sent 0 messages received 0 signals received 18065 voluntary context switches 26547 involuntary context switches That's a reasonably exceptional time, because it had to build the archive for the year to date, and you only take this hit once. Once the archive is up and running, you're only building HTML files for new messages since the last update, which is (or should be) considerably faster. Regrettably at the moment, there's a bug in Glimpse 4.1, which means that you need to reindex the entire archive, rather than just those bits that change. Fortunately, there are command line switches to tell the glimpseindex program how much memory to use. That 8572 max. resident size figure is from MHonArc rather than glimpse, since it reads in (as far as I can tell) the whole of the mail archive file before processing it. While the conversion was happening the load on my machine hovered around the .9-1.1 mark. With X, Netscape, XEmacs and a bunch of xterms open. At the end of the conversion process I had a threaded copy of the -hackers mail archives going back almost three months. Each month has two indices -- a date index where you see all the messages in the order they came in, and a threaded index. Each index shows (at most) 200 messages (that's a configurable number). This is so the size of the index files doesn't grow without end. Each index shows a "This is page x of y of the threaded index" comment, with navigation text to go backwards and forwards in the index. This whole thing is searchable, allowing searches by combination of keywords. You can specify the the number of misspellings to allow, the number of hits to return, case sensitivity, and which months to restrict your search to. The only thing you can't do (at the moment) is search across more than one mailing list. It shouldn't be too hard to add. Right now, I don't have a URL I can give to show you the results, since I ran out of time last night (I must be getting old, I used to be able to do 72 hour coding runs and not really feel it ). I should be able to get something demonstrable up on my freefall account by the middle of next week. In light of all that, do you think this is worth pursuing further? Thoughts? N -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 02:12:19 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id CAA07723 for freebsd-database-outgoing; Mon, 30 Mar 1998 02:12:19 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from tyree.iii.co.uk (tyree.iii.co.uk [195.89.149.230]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id CAA07685; Mon, 30 Mar 1998 02:12:15 -0800 (PST) (envelope-from nik@iii.co.uk) From: nik@iii.co.uk Received: from carrig.strand.iii.co.uk (carrig.strand.iii.co.uk [192.168.7.25]) by tyree.iii.co.uk (8.8.8/8.8.8) with ESMTP id LAA11206; Mon, 30 Mar 1998 11:12:08 +0100 (BST) Received: (from nik@localhost) by carrig.strand.iii.co.uk (8.8.8/8.8.7) id LAA06619; Mon, 30 Mar 1998 11:11:56 +0100 (BST) Message-ID: <19980330111155.24790@iii.co.uk> Date: Mon, 30 Mar 1998 11:11:55 +0100 To: shimon@simon-shapiro.org Cc: Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Re: Mailing list search interface References: <19980330110200.17368@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.85e In-Reply-To: <19980330110200.17368@iii.co.uk>; from nik@iii.co.uk on Mon, Mar 30, 1998 at 11:02:00AM +0100 Organization: interactive investor Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, Mar 30, 1998 at 11:02:00AM +0100, nik@iii.co.uk wrote: > Gents, Mea culpa. That wasn't intended to go to the mailing list. Sorry folks. N -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 02:18:20 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id CAA08344 for freebsd-database-outgoing; Mon, 30 Mar 1998 02:18:20 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from rah.star-gate.com (rah.star-gate.com [209.133.7.234]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id CAA08336; Mon, 30 Mar 1998 02:18:17 -0800 (PST) (envelope-from hasty@rah.star-gate.com) Received: from rah.star-gate.com (localhost.star-gate.com [127.0.0.1]) by rah.star-gate.com (8.8.8/8.8.8) with ESMTP id CAA08193; Mon, 30 Mar 1998 02:17:31 -0800 (PST) (envelope-from hasty@rah.star-gate.com) Message-Id: <199803301017.CAA08193@rah.star-gate.com> X-Mailer: exmh version 2.0.2 2/24/98 To: nik@iii.co.uk cc: shimon@simon-shapiro.org, Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami Subject: Re: Mailing list search interface In-reply-to: Your message of "Mon, 30 Mar 1998 11:11:55 +0100." <19980330111155.24790@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Mon, 30 Mar 1998 02:17:31 -0800 From: Amancio Hasty Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk Don't by shy for I am sure that they are many who wish a better mailing list search engine and who knows perhaps a few more hackers actually working on this problem! Regards, Amancio To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 02:38:29 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id CAA11046 for freebsd-database-outgoing; Mon, 30 Mar 1998 02:38:29 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from mail.cs.tu-berlin.de (root@mail.cs.tu-berlin.de [130.149.17.13]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id CAA10997; Mon, 30 Mar 1998 02:38:00 -0800 (PST) (envelope-from wosch@cs.tu-berlin.de) Received: from caramba.cs.tu-berlin.de (wosch@caramba.cs.tu-berlin.de [130.149.17.12]) by mail.cs.tu-berlin.de (8.8.8/8.8.8) with ESMTP id MAA04474; Mon, 30 Mar 1998 12:31:38 +0200 (MET DST) Received: (from wosch@localhost) by caramba.cs.tu-berlin.de (8.8.8/8.8.8) id MAA03346; Mon, 30 Mar 1998 12:31:31 +0200 (MET DST) Message-ID: <19980330123130.39177@caramba.cs.tu-berlin.de> Date: Mon, 30 Mar 1998 12:31:30 +0200 From: Wolfram Schneider To: shimon@simon-shapiro.org, Wolfram Schneider Cc: freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: ; from Simon Shapiro on Sun, Mar 29, 1998 at 01:57:30PM -0800 Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 1998-03-29 13:57:30 -0800, Simon Shapiro wrote: > We have been playing with the idea of normalizing the archive into an > RDBMS. Some of the benefits are: > > * no need to update the threads database. It will always be updated. > * Users can create, easily, their own thread logic with no impact on > system performance. > * Searching on normalized fields are many times faster, and much less > costly in system resources. Some figures ... The FreeBSD mailing list archive is 620MB large. There are currently 270,000 messages. The archive grow with 100,000 messages/year. If you plan to use a real SQL database, you should consider at least 500,000 data sets, better 1 million. You need 2GB for the raw E-Mails and 2-4GB for the index. I don't know if there are free available databases which can handle this large data. That was the hardware part. You must hire a database expert, a Web designer and a cgi script programmer. All people should be willing to work for at least 2-3 years on this project. This is not an easy task. A full update of the thread database took 6 min on hub (Pentium Pro), thats 100MB/min ;-) An update for the last week took 3-6 seconds. -- Wolfram Schneider http://www.freebsd.org/~wosch/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 02:59:49 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id CAA13953 for freebsd-database-outgoing; Mon, 30 Mar 1998 02:59:49 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from rah.star-gate.com (rah.star-gate.com [209.133.7.234]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id CAA13947; Mon, 30 Mar 1998 02:59:46 -0800 (PST) (envelope-from hasty@rah.star-gate.com) Received: from rah.star-gate.com (localhost.star-gate.com [127.0.0.1]) by rah.star-gate.com (8.8.8/8.8.8) with ESMTP id CAA08389; Mon, 30 Mar 1998 02:59:06 -0800 (PST) (envelope-from hasty@rah.star-gate.com) Message-Id: <199803301059.CAA08389@rah.star-gate.com> X-Mailer: exmh version 2.0.2 2/24/98 To: Wolfram Schneider cc: shimon@simon-shapiro.org, freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update In-reply-to: Your message of "Mon, 30 Mar 1998 12:31:30 +0200." <19980330123130.39177@caramba.cs.tu-berlin.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Mon, 30 Mar 1998 02:59:05 -0800 From: Amancio Hasty Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk And if people do a decent job they may be able to sell the project, complete with OS and computer 8) Have Fun, Amancio > On 1998-03-29 13:57:30 -0800, Simon Shapiro wrote: > > We have been playing with the idea of normalizing the archive into an > > RDBMS. Some of the benefits are: > > > > * no need to update the threads database. It will always be updated. > > * Users can create, easily, their own thread logic with no impact on > > system performance. > > * Searching on normalized fields are many times faster, and much less > > costly in system resources. > > Some figures ... > > The FreeBSD mailing list archive is 620MB large. There are currently > 270,000 messages. The archive grow with 100,000 messages/year. > > If you plan to use a real SQL database, you should consider at least > 500,000 data sets, better 1 million. You need 2GB for the raw E-Mails > and 2-4GB for the index. I don't know if there are free available > databases which can handle this large data. > > That was the hardware part. You must hire a database expert, a Web > designer and a cgi script programmer. All people should be willing to work > for at least 2-3 years on this project. This is not an easy task. > > > A full update of the thread database took 6 min on hub (Pentium Pro), > thats 100MB/min ;-) An update for the last week took 3-6 seconds. > > -- > Wolfram Schneider http://www.freebsd.org/~wosch/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 06:07:05 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id GAA07039 for freebsd-database-outgoing; Mon, 30 Mar 1998 06:07:05 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id GAA07033; Mon, 30 Mar 1998 06:07:04 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id JAA06907; Mon, 30 Mar 1998 09:06:45 -0500 (EST) Date: Mon, 30 Mar 1998 09:06:45 -0500 (EST) From: John Fieber To: Wolfram Schneider cc: shimon@simon-shapiro.org, freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update In-Reply-To: <19980330123130.39177@caramba.cs.tu-berlin.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998, Wolfram Schneider wrote: > On 1998-03-29 13:57:30 -0800, Simon Shapiro wrote: > > We have been playing with the idea of normalizing the archive into an > > RDBMS. Some of the benefits are: > > > > * no need to update the threads database. It will always be updated. > > * Users can create, easily, their own thread logic with no impact on > > system performance. > > * Searching on normalized fields are many times faster, and much less > > costly in system resources. [snip] > If you plan to use a real SQL database, you should consider at least > 500,000 data sets, better 1 million. You need 2GB for the raw E-Mails > and 2-4GB for the index. I don't know if there are free available > databases which can handle this large data. It has been well established for many years by professionals in database R&D that traditional a RDBMS are utterly and completely the wrong tool for free text searching. This turns out to be true even for some relatively structured data types like bibliographic records. There *are* some tasks in a real-world applications that are RDBMS type things--a message-id based thread index is simple to implement for instance--so I'm all for hybrid systems. The big RDBMS vendors usually have some optional module optimized for free-text searching module and some SQL extensions to access it. I've pondered writing such a module for postgres, but don't really know enough about extending postgres to know how well it would work. -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 06:49:01 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id GAA12735 for freebsd-database-outgoing; Mon, 30 Mar 1998 06:49:01 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id GAA12723; Mon, 30 Mar 1998 06:48:58 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id JAA06977; Mon, 30 Mar 1998 09:48:45 -0500 (EST) Date: Mon, 30 Mar 1998 09:48:45 -0500 (EST) From: John Fieber To: nik@iii.co.uk cc: shimon@simon-shapiro.org, Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Re: Mailing list search interface In-Reply-To: <19980330110200.17368@iii.co.uk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998 nik@iii.co.uk wrote: > I mentioned MHonArc to Jordan, and his first response was > > > Eeek! The evil MHonArc resurfaces! ;-) > > > > It doesn't scale at all well - just try MHonArc'ing a really big mailing > > list archive. You soon get a set of monster html files that are > > essentially unusable - I know, I did the short-lived "FreeBSD Docs" > > CD for awhile using MHonArc. Listen to the man! He knows what he is talking about...well, in this case at least. :) > I think he's been using an older version of MHonArc. I did some tests > late last week, archiving and indexing the archives for -hackers from > the beginning of 1998. That's 11,265K or thereabouts. > > At the end of the conversion (which consisted of running MHonArc 2.2.0 > over the files, and then using Glimpse 4.1 to index them) I had a total > of 32,910K HTML and index files. > > The output of 'time -l' on the conversion process was: > > 626.11 real 438.83 user 93.13 sys On what sort of hardware? By quick back-of-an-envelope calculations, this is slower than the current indexing scheme on hub by at least a factor of 10. Indexing anything large is typically an I/O bound operation and when you start indexing much more than can fit in RAM, your performance will degrade dramatically, so it is probably slower by much more than a factor of 10. It currently takes about 45 minutes to index all 620+ megabytes of mail from scratch on hub and most of that is waiting for disk i/o, since the disks on hub are pretty busy even without disk activity. > At the end of the conversion process I had a threaded copy of the -hackers > mail archives going back almost three months. Three months of -hackers != to 5 years of all the mailings lists. I am confident that you will find that this scheme becomes a big hairy hassle when you throw the whole thing at it. It is space inefficient because you have the original archive, plus the HTML versions (most of which will *never* be viewed I might add), the index, and the filesystem overhead of one file per message. Because the theading is done in batch mode, it is awkward to make enhancements to the threading algorithm. It is a hassle to retro-actively change the conversion to HTML. Though I have no first-hand proof, knowing how Glimpse works, I suspect searches will generate quite a bit more disk I/O on the server than freeWAIS. The ranking algorithm that Glimpse uses (or used last I checked) is primative. (In an survey of what people liked, hated and most wanted in the mailing list archives, people wanted thread searching and date sorting, but only second and third *after* the currently implemented ranking algorithm, which most people found to work very well most of the time.) And on and on... I think it is time to add an FAQ entry on why we don't use hypermail or MHonArc for the mailing list archives. It isn't that things like MHonArc are not valliant efforts, but they are merely refinemests of what is fundamentally a quick-and-dirty, non-scalable solution. As I hinted in another message, a proper solution would be based on a hybrid full text/RDBMS. Whether a true hybrid system is built, or just the illusion is built using some crafty CGI scripts is a detail to be worked out. -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 07:05:42 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id HAA14784 for freebsd-database-outgoing; Mon, 30 Mar 1998 07:05:42 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from ns1.yes.no (ns1.yes.no [195.119.24.10]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA14777 for ; Mon, 30 Mar 1998 07:05:36 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [194.198.43.36]) by ns1.yes.no (8.8.7/8.8.7) with ESMTP id QAA26529; Mon, 30 Mar 1998 16:04:50 GMT Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id RAA02102; Mon, 30 Mar 1998 17:05:26 +0200 (MET DST) Message-ID: <19980330170411.35652@follo.net> Date: Mon, 30 Mar 1998 17:04:11 +0200 From: Eivind Eklund To: John Fieber Cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface References: <19980330110200.17368@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.89.1i In-Reply-To: ; from John Fieber on Mon, Mar 30, 1998 at 09:48:45AM -0500 Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, Mar 30, 1998 at 09:48:45AM -0500, John Fieber wrote: > And on and on... I think it is time to add an FAQ entry on why > we don't use hypermail or MHonArc for the mailing list archives. I've commented (several times) that www.findmail.com is offering to archive mailing lists, and have a very nice interface (I've elected to drop out of some mailing lists and only read them through findmail...) If you don't want the work of maintaining this, they seem to be doing a _very_ good work on archiving and retrieving messages. We permit external archives, right? John, do you mind if I point them at the exported raw data? Eivind. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 07:41:18 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id HAA21426 for freebsd-database-outgoing; Mon, 30 Mar 1998 07:41:18 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from tyree.iii.co.uk (tyree.iii.co.uk [195.89.149.230]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA21418; Mon, 30 Mar 1998 07:41:13 -0800 (PST) (envelope-from nik@iii.co.uk) From: nik@iii.co.uk Received: from carrig.strand.iii.co.uk (carrig.strand.iii.co.uk [192.168.7.25]) by tyree.iii.co.uk (8.8.8/8.8.8) with ESMTP id QAA20303; Mon, 30 Mar 1998 16:40:39 +0100 (BST) Received: (from nik@localhost) by carrig.strand.iii.co.uk (8.8.8/8.8.7) id QAA07247; Mon, 30 Mar 1998 16:40:25 +0100 (BST) Message-ID: <19980330164024.47510@iii.co.uk> Date: Mon, 30 Mar 1998 16:40:24 +0100 To: John Fieber Cc: shimon@simon-shapiro.org, Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Re: Mailing list search interface References: <19980330110200.17368@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.85e In-Reply-To: ; from John Fieber on Mon, Mar 30, 1998 at 09:48:45AM -0500 Organization: interactive investor Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, Mar 30, 1998 at 09:48:45AM -0500, John Fieber wrote: > > At the end of the conversion (which consisted of running MHonArc 2.2.0 > > over the files, and then using Glimpse 4.1 to index them) I had a total > > of 32,910K HTML and index files. > > > > The output of 'time -l' on the conversion process was: > > > > 626.11 real 438.83 user 93.13 sys > > On what sort of hardware? 200 Mhz PPro w/64MB of RAM and 256MB of swap. At the time I was running XFree86 3.3.2, Netscape, Xemacs and a dozen or so xterms (tcsh, mutt, slrn). Load hovered around the .9-1.1 mark. Interactive response was fine. My disk is single 2GB Atlas II, with tagged queuing turned *off* (because of buggy firmware which I haven't updated yet). > By quick back-of-an-envelope calculations, this is slower than > the current indexing scheme on hub by at least a factor of 10. The time above was for creation of the HTML archives and for indexing, not just indexing alone. > Indexing anything large is typically an I/O bound operation and > when you start indexing much more than can fit in RAM, your > performance will degrade dramatically, so it is probably slower > by much more than a factor of 10. Don't know. I'll grab last years archive of -hackers (or another one, if there's another you think would be more representative) and try that. I can bring back figures for the time to create the entire archive (and index), the time just to index, and the time to add a new message and then reindex. I'd try this with the whole of the archives, but I don't have the spare disk space (yet). > Three months of -hackers != to 5 years of all the mailings lists. > I am confident that you will find that this scheme becomes a big > hairy hassle when you throw the whole thing at it. True enough. As I say, I'll try it and see. > The ranking algorithm that Glimpse uses (or used last I checked) > is primative. (In an survey of what people liked, hated and most > wanted in the mailing list archives, people wanted thread > searching and date sorting, but only second and third *after* the > currently implemented ranking algorithm, which most people found > to work very well most of the time.) Are those survey results available online somewhere? > It isn't that things like MHonArc are not valliant efforts, but > they are merely refinemests of what is fundamentally a > quick-and-dirty, non-scalable solution. As I hinted in another > message, a proper solution would be based on a hybrid full > text/RDBMS. Whether a true hybrid system is built, or just the > illusion is built using some crafty CGI scripts is a detail to be > worked out. A hybrid system is on my list of things to build here (but it'll be Oracle based). I haven't investigated Postgres enough to know if it's up to the task. N -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 07:41:38 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id HAA21490 for freebsd-database-outgoing; Mon, 30 Mar 1998 07:41:38 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from mail.cs.tu-berlin.de (root@mail.cs.tu-berlin.de [130.149.17.13]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA21417; Mon, 30 Mar 1998 07:41:11 -0800 (PST) (envelope-from wosch@cs.tu-berlin.de) Received: from caramba.cs.tu-berlin.de (wosch@caramba.cs.tu-berlin.de [130.149.17.12]) by mail.cs.tu-berlin.de (8.8.8/8.8.8) with ESMTP id RAA03617; Mon, 30 Mar 1998 17:34:11 +0200 (MET DST) Received: (from wosch@localhost) by caramba.cs.tu-berlin.de (8.8.8/8.8.8) id RAA25384; Mon, 30 Mar 1998 17:33:58 +0200 (MET DST) Message-ID: <19980330173358.57866@caramba.cs.tu-berlin.de> Date: Mon, 30 Mar 1998 17:33:58 +0200 From: Wolfram Schneider To: John Fieber , nik@iii.co.uk Cc: shimon@simon-shapiro.org, Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Re: Mailing list search interface References: <19980330110200.17368@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: ; from John Fieber on Mon, Mar 30, 1998 at 09:48:45AM -0500 Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 1998-03-30 09:48:45 -0500, John Fieber wrote: > > I mentioned MHonArc to Jordan, and his first response was > > > > > Eeek! The evil MHonArc resurfaces! ;-) > > > > > > It doesn't scale at all well - just try MHonArc'ing a really big mailing > > > list archive. You soon get a set of monster html files that are > > > essentially unusable - I know, I did the short-lived "FreeBSD Docs" > > > CD for awhile using MHonArc. > > Listen to the man! He knows what he is talking about...well, in > this case at least. :) Agreed. > Though I have no first-hand proof, knowing how Glimpse works, I > suspect searches will generate quite a bit more disk I/O on the > server than freeWAIS. There is a technical report about glimpse, 10 pages. I strongly recommend to read this paper before using glimpse in real word applications! ftp://ftp.cs.arizona.edu/glimpse/glimpse.ps.Z Basically, glimpse does a linear full text search like grep. Searching 400MB E-Mails will take twice the time (for CPU *and* disk I/O) as seaching in 200MB. Glimpse does not scale by design. In best case glimpse is 256 x faster than grep, in worst case it is slow as grep. > And on and on... I think it is time to add an FAQ entry on why > we don't use hypermail or MHonArc for the mailing list archives. ;-) -- Wolfram Schneider http://www.freebsd.org/~wosch/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 08:03:43 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id IAA24310 for freebsd-database-outgoing; Mon, 30 Mar 1998 08:03:43 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA24300 for ; Mon, 30 Mar 1998 08:03:41 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id LAA07196; Mon, 30 Mar 1998 11:03:39 -0500 (EST) Date: Mon, 30 Mar 1998 11:03:39 -0500 (EST) From: John Fieber To: nik@iii.co.uk cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface In-Reply-To: <19980330164024.47510@iii.co.uk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk [excessive CC list removed] On Mon, 30 Mar 1998 nik@iii.co.uk wrote: > On Mon, Mar 30, 1998 at 09:48:45AM -0500, John Fieber wrote: > > > The output of 'time -l' on the conversion process was: > > > > > > 626.11 real 438.83 user 93.13 sys > > > > On what sort of hardware? > > 200 Mhz PPro w/64MB of RAM and 256MB of swap. So, in the same ballpark of hub. Hub has more RAM, but I limit the RAM consumption of waisindex to around 25-30MB because there is a lot of other stuff going on on the machine that I don't want to interfere with. > > By quick back-of-an-envelope calculations, this is slower than > > the current indexing scheme on hub by at least a factor of 10. > > The time above was for creation of the HTML archives and for indexing, > not just indexing alone. Ah, but the thread index creation is inseparable from the creation of the HTML archives, yet the HTML creation is a complete waste of time and disk space. It is far more efficient to generate the HTML on the fly because only a tiny fraction of the messages will ever be viewed. Contrast with Wolfram's thread scheme which just builds a message-id based index for threads. > Are those survey results available online somewhere? No, I'll have to dig a bit and they are probably not in a very useful form. I'll have to fire up SPSS and generate some reports... -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 10:10:48 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id KAA15677 for freebsd-database-outgoing; Mon, 30 Mar 1998 10:10:48 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id KAA15669 for ; Mon, 30 Mar 1998 10:10:44 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 1390 invoked from network); 30 Mar 1998 18:19:49 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 18:19:49 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 30 Mar 1998 10:19:49 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: Bob Bishop Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: Amancio Hasty , (Satoshi Asami) , scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 Bob Bishop wrote: > At 10:57 pm +0100 29/3/98, Simon Shapiro wrote: >>.. >>We have been playing with the idea of normalizing the archive into an >>RDBMS. [etc] > > Great, but to be useful you need free-text search too. Yup. This is in the head-scratching stage still. I was thinking of either glimpse or maybe simply extracting the blob and applying regex to it. Comments? ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 10:16:05 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id KAA16657 for freebsd-database-outgoing; Mon, 30 Mar 1998 10:16:05 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from hub.org (hub.org [209.47.148.200]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id KAA16652 for ; Mon, 30 Mar 1998 10:16:04 -0800 (PST) (envelope-from scrappy@hub.org) Received: from localhost (scrappy@localhost) by hub.org (8.8.8/8.7.5) with SMTP id NAA28387; Mon, 30 Mar 1998 13:16:01 -0500 (EST) Date: Mon, 30 Mar 1998 13:16:00 -0500 (EST) From: The Hermit Hacker To: Simon Shapiro cc: freebsd-database@FreeBSD.ORG Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998, Simon Shapiro wrote: > > On 30-Mar-98 Bob Bishop wrote: > > At 10:57 pm +0100 29/3/98, Simon Shapiro wrote: > >>.. > >>We have been playing with the idea of normalizing the archive into an > >>RDBMS. [etc] > > > > Great, but to be useful you need free-text search too. > > Yup. This is in the head-scratching stage still. I was thinking of either > glimpse or maybe simply extracting the blob and applying regex to it. > Comments? Just checked into it, and *supposedly* there is a free-text search module for PostgreSQL available... To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 10:17:24 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id KAA16801 for freebsd-database-outgoing; Mon, 30 Mar 1998 10:17:24 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id KAA16792 for ; Mon, 30 Mar 1998 10:17:16 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 1515 invoked from network); 30 Mar 1998 18:26:29 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 18:26:29 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <19980330110200.17368@iii.co.uk> Date: Mon, 30 Mar 1998 10:26:29 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: nik@iii.co.uk Subject: RE: Mailing list search interface Cc: Amancio Hasty , Satoshi Asami , scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk I have no strong opinions in this matter. My experience with indexing methods residing in (essentially) flat Unix files is that they do not scale well. This is what database engines are for. Truth must be told, currently PostgreSQL uses Unix files to store its indices and tables, so performance is not all that it could be. I am working on building a raw device storage manager for PostgreSQL, which will allow shared access (cluster like) and much faster speed. The only issue I have not settled on is how to search the message bodies. Maybe I get some free time soon and will try few things. What is an acceptable search rate? For header type data? For body regex? BTW, if your project is almost ready, go ahead with it. It does not conflict at all with what I am thinking of. On 30-Mar-98 nik@iii.co.uk wrote: > Gents, > > On Sun, Mar 29, 1998 at 01:57:30PM -0800, Simon Shapiro wrote: >> On 26-Mar-98 Wolfram Schneider wrote: >> > The FreeBSD mailing list search interface support threads. The >> > thread database will be updated hourly. Of course there are >> > many things to do to make the threads more user friendly. >> >> We have been playing with the idea of normalizing the archive into an >> RDBMS. Some of the benefits are: > > > > Could we coordinate on some of this? I've been working on a system (at > work) for making some of our mailing list archives visible and searchable > on our internal site. I'm using MHonArc, Glimpse (both of which are in > the ports tree) and a customised version of Wilma > > > > and it's almost at the point where this would be useful for the project. > > I mentioned MHonArc to Jordan, and his first response was > >> Eeek! The evil MHonArc resurfaces! ;-) >> >> It doesn't scale at all well - just try MHonArc'ing a really big mailing >> list archive. You soon get a set of monster html files that are >> essentially unusable - I know, I did the short-lived "FreeBSD Docs" >> CD for awhile using MHonArc. > > I think he's been using an older version of MHonArc. I did some tests > late last week, archiving and indexing the archives for -hackers from > the beginning of 1998. That's 11,265K or thereabouts. > > At the end of the conversion (which consisted of running MHonArc 2.2.0 > over the files, and then using Glimpse 4.1 to index them) I had a total > of 32,910K HTML and index files. > > The output of 'time -l' on the conversion process was: > > 626.11 real 438.83 user 93.13 sys > 8572 maximum resident set size > 390 average shared memory size > 4311 average unshared data size > 128 average unshared stack size > 1054806 page reclaims > 68 page faults > 0 swaps > 9725 block input operations > 6115 block output operations > 0 messages sent > 0 messages received > 0 signals received > 18065 voluntary context switches > 26547 involuntary context switches > > That's a reasonably exceptional time, because it had to build the archive > for the year to date, and you only take this hit once. Once the archive > is up and running, you're only building HTML files for new messages since > the last update, which is (or should be) considerably faster. > > Regrettably at the moment, there's a bug in Glimpse 4.1, which means that > you need to reindex the entire archive, rather than just those bits that > change. Fortunately, there are command line switches to tell the > glimpseindex program how much memory to use. > > That 8572 max. resident size figure is from MHonArc rather than glimpse, > since it reads in (as far as I can tell) the whole of the mail archive > file before processing it. > > While the conversion was happening the load on my machine hovered around > the .9-1.1 mark. With X, Netscape, XEmacs and a bunch of xterms open. > > At the end of the conversion process I had a threaded copy of the > -hackers > mail archives going back almost three months. > > Each month has two indices -- a date index where you see all the messages > in the order they came in, and a threaded index. > > Each index shows (at most) 200 messages (that's a configurable number). > This is so the size of the index files doesn't grow without end. Each > index shows a "This is page x of y of the threaded index" comment, with > navigation text to go backwards and forwards in the index. > > This whole thing is searchable, allowing searches by combination of > keywords. You can specify the the number of misspellings to allow, the > number of hits to return, case sensitivity, and which months to restrict > your search to. > > The only thing you can't do (at the moment) is search across more than > one > mailing list. It shouldn't be too hard to add. Right now, I don't have a > URL I can give to show you the results, since I ran out of time last > night > (I must be getting old, I used to be able to do 72 hour coding runs and > not > really feel it ). I should be able to get something demonstrable > up on my freefall account by the middle of next week. > > In light of all that, do you think this is worth pursuing further? > > Thoughts? > > N > -- > Work: nik@iii.co.uk | FreeBSD + Perl + Apache > Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need > Play: nik@freebsd.org | Microsoft? ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 10:44:41 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id KAA19588 for freebsd-database-outgoing; Mon, 30 Mar 1998 10:44:41 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from isbalham.ist.co.uk (isbalham.ist.co.uk [192.31.26.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id KAA19576 for ; Mon, 30 Mar 1998 10:44:36 -0800 (PST) (envelope-from rb@gid.co.uk) Received: from gid.co.uk (uucp@localhost) by isbalham.ist.co.uk (8.8.7/8.8.4) with UUCP id TAA21728; Mon, 30 Mar 1998 19:43:13 +0100 (BST) Received: from [194.32.164.2] by seagoon.gid.co.uk; Mon, 30 Mar 1998 19:48:50 +0100 (BST) X-Sender: rb@194.32.164.1 Message-Id: In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 30 Mar 1998 19:45:33 +0100 To: shimon@simon-shapiro.org From: Bob Bishop Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: Amancio Hasty , (Satoshi Asami) , scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk At 7:19 pm +0100 30/3/98, Simon Shapiro wrote: >On 30-Mar-98 Bob Bishop wrote: >>...free-text search too. > >Yup. This is in the head-scratching stage still. I was thinking of either >glimpse or maybe simply extracting the blob and applying regex to it. >Comments? Slooow. The best you can say is that they both (IIRC about glimpse) scale linearly. You really need something dictionary-based. -- Bob Bishop (0118) 977 4017 international code +44 118 rb@gid.co.uk fax (0118) 989 4254 between 0800 and 1800 UK To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 10:56:40 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id KAA22776 for freebsd-database-outgoing; Mon, 30 Mar 1998 10:56:40 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id KAA22767 for ; Mon, 30 Mar 1998 10:56:35 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id NAA07824; Mon, 30 Mar 1998 13:56:24 -0500 (EST) Date: Mon, 30 Mar 1998 13:56:24 -0500 (EST) From: John Fieber To: Simon Shapiro cc: freebsd-database@FreeBSD.ORG Subject: RE: Mailing list search interface In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998, Simon Shapiro wrote: > Truth must be told, currently PostgreSQL uses Unix files to store its > indices and tables, so performance is not all that it could be. I am A properly constructed index for a full text database (read: NOT glimpse) requires very little disk i/o for most queries. Eg, prefix trie hashing requires about two reads per search term in the query. I just read a paper describing some optimtzaion that reduces that to one read about 50% of the time. -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 11:43:10 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id LAA03738 for freebsd-database-outgoing; Mon, 30 Mar 1998 11:43:10 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id LAA03683 for ; Mon, 30 Mar 1998 11:43:03 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 2790 invoked from network); 30 Mar 1998 19:52:12 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 19:52:12 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <19980330123130.39177@caramba.cs.tu-berlin.de> Date: Mon, 30 Mar 1998 11:52:11 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: Wolfram Schneider Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: Amancio Hasty , Satoshi Asami , scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 Wolfram Schneider wrote: > On 1998-03-29 13:57:30 -0800, Simon Shapiro wrote: >> We have been playing with the idea of normalizing the archive into an >> RDBMS. Some of the benefits are: >> >> * no need to update the threads database. It will always be updated. >> * Users can create, easily, their own thread logic with no impact on >> system performance. >> * Searching on normalized fields are many times faster, and much less >> costly in system resources. > > Some figures ... > > The FreeBSD mailing list archive is 620MB large. There are currently > 270,000 messages. The archive grow with 100,000 messages/year. Excellent. How many years back do we want to keep? > If you plan to use a real SQL database, you should consider at least > 500,000 data sets, better 1 million. You need 2GB for the raw E-Mails > and 2-4GB for the index. I don't know if there are free available > databases which can handle this large data. Large? Assume 1 million messages in the ``current'' database. People can search the ``ancient'' database separately. Even if your dataset numbers are correct, this fits in 2 4GB partitions in a RAID array. For 4 million records, an indexed search in PostgreSQL 6.2.1 took about 1-2 seconds on a busy system (make buildworld in the background). > That was the hardware part. You must hire a database expert, a Web > designer and a cgi script programmer. All people should be willing to > work for at least 2-3 years on this project. This is not an easy task. Using your logic, we should close the FreeBSD project, as maintaining an Operating system like this takes 200-300 kernel experts. The database expert is available and willing to do it for free. If not, there are other database experts amoung FreeBSD users. A CGI interface already exists for the database interface. The HTML interface can be written by people like those who did the excellent job on the FreeBSD web pages. In other words, if the FreeBSD project cannot find the people to do this, then noone can. BTW, your time estimate is good ig you plan to e paid hourly for it. I nuilt much, much more complex RDBMS based information systems in fraction of that time. An email parser is no more than a week. The text search about the same. > A full update of the thread database took 6 min on hub (Pentium Pro), > thats 100MB/min ;-) An update for the last week took 3-6 seconds. Something is too good to be true here. How can you read Unix filesystems at 100 Megabytes per second? Also, if the current engine is so great, how come all these people are excited about replacing it? I have no opinion as my usage is too scarce and too superficial to vioce any opinion. My position is that IF there is a desire to build an RDBMS based engine, I will be happy to contribute my modest knowledge in the matters and some of my time. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 11:53:38 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id LAA05906 for freebsd-database-outgoing; Mon, 30 Mar 1998 11:53:38 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id LAA05893 for ; Mon, 30 Mar 1998 11:53:32 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 2922 invoked from network); 30 Mar 1998 20:02:45 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 20:02:45 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 30 Mar 1998 12:02:45 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: John Fieber Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: Amancio Hasty , Satoshi Asami , scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 John Fieber wrote: ... > It has been well established for many years by professionals in > database R&D that traditional a RDBMS are utterly and completely > the wrong tool for free text searching. This turns out to be > true even for some relatively structured data types like > bibliographic records. I made a descent carreer building systems that `established professionals'' said could not be built. We can discuss some of these privately :-) Having said that, you are probably right, to a degree. The way around it is NOT to search free text in the database. > There *are* some tasks in a real-world applications that are > RDBMS type things--a message-id based thread index is simple to > implement for instance--so I'm all for hybrid systems. The big > RDBMS vendors usually have some optional module optimized for > free-text searching module and some SQL extensions to access it. > I've pondered writing such a module for postgres, but don't > really know enough about extending postgres to know how well it > would work. This is what I had in mind exactly. To normalize what can be normalized, and leave the rest of it as text. The problem, in a UFS, is that when the number of files in the filesystem grows, directory searches become very costly. The mail archive (as secondary as it may appear) is an opportunity to investigate these issues. With million messages split across several dozens directories (unless you has the message IDs into the lists, etc.), you should be seeing some performance dgradation in open(2), which does directory scans. How about putting the message body as TEXT datatype into the RDBMS. At least you can query it by some integer index. This means you can use a B-Tree to find the message, rather than dirscan. If the message is in a blob, applying regex to it, from within the database can be optimized. Another option you mention, and Postgres is IDEAL for that, is a new, native data type. Search logic can then be applied, and even integrated with the system. Something to think about. Simon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 12:15:53 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id MAA10996 for freebsd-database-outgoing; Mon, 30 Mar 1998 12:15:53 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id MAA10988 for ; Mon, 30 Mar 1998 12:15:48 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 3345 invoked from network); 30 Mar 1998 20:25:00 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 20:25:00 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <19980330164024.47510@iii.co.uk> Date: Mon, 30 Mar 1998 12:25:00 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: nik@iii.co.uk Subject: Re: Mailing list search interface Cc: Amancio Hasty , Satoshi Asami , scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider , John Fieber Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 nik@iii.co.uk wrote: ... > My disk is single 2GB Atlas II, with tagged queuing turned *off* (because > of buggy firmware which I haven't updated yet). Ah! This is useful information. thanx! >> By quick back-of-an-envelope calculations, this is slower than >> the current indexing scheme on hub by at least a factor of 10. > > The time above was for creation of the HTML archives and for indexing, > not just indexing alone. This is something we need to keep in mind. Generating 100% output coverage for (probably) less than 10% need is wasteful. >> Indexing anything large is typically an I/O bound operation and >> when you start indexing much more than can fit in RAM, your >> performance will degrade dramatically, so it is probably slower >> by much more than a factor of 10. > > Don't know. I'll grab last years archive of -hackers (or another one, > if there's another you think would be more representative) and try that. > I can bring back figures for the time to create the entire archive (and > index), the time just to index, and the time to add a new message and > then reindex. Listen to the man :-) It gets worse. Extrapolation on a non-linear function is called gambling :-) You will run into scaling problems at certain sizes. The worsening can be dramatic. > I'd try this with the whole of the archives, but I don't have the spare > disk space (yet). I have. Is there an efficient way to get the whole archive here? Downloading on a modem is NOT considered efficient. > Are those survey results available online somewhere? Please! > A hybrid system is on my list of things to build here (but it'll be > Oracle based). I haven't investigated Postgres enough to know if it's > up to the task. Oracle based is good. Now, plase tell us how to run Oracle on FreeBSD, legally, and with source available. PostgreSQL is up to the task. This is not a dramatically complex database problem. Pretty much a linear table, with the text searching TBD. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 12:20:54 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id MAA12016 for freebsd-database-outgoing; Mon, 30 Mar 1998 12:20:54 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id MAA12009 for ; Mon, 30 Mar 1998 12:20:50 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 3514 invoked from network); 30 Mar 1998 20:30:04 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 20:30:04 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 30 Mar 1998 12:30:04 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: The Hermit Hacker Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: freebsd-database@FreeBSD.ORG Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 The Hermit Hacker wrote: > On Mon, 30 Mar 1998, Simon Shapiro wrote: > >> >> On 30-Mar-98 Bob Bishop wrote: >> > At 10:57 pm +0100 29/3/98, Simon Shapiro wrote: >> >>.. >> >>We have been playing with the idea of normalizing the archive into an >> >>RDBMS. [etc] >> > >> > Great, but to be useful you need free-text search too. >> >> Yup. This is in the head-scratching stage still. I was thinking of >> either >> glimpse or maybe simply extracting the blob and applying regex to it. >> Comments? > > Just checked into it, and *supposedly* there is a free-text search > module for PostgreSQL available... Which? Where? This is good news. Saves some headaches. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 12:24:10 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id MAA12356 for freebsd-database-outgoing; Mon, 30 Mar 1998 12:24:10 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id MAA12342 for ; Mon, 30 Mar 1998 12:24:05 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 3571 invoked from network); 30 Mar 1998 20:33:15 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 20:33:15 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 30 Mar 1998 12:33:15 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: Bob Bishop Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, (Satoshi Asami) , Amancio Hasty Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 Bob Bishop wrote: > At 7:19 pm +0100 30/3/98, Simon Shapiro wrote: >>On 30-Mar-98 Bob Bishop wrote: >>>...free-text search too. >> >>Yup. This is in the head-scratching stage still. I was thinking of >>either >>glimpse or maybe simply extracting the blob and applying regex to it. >>Comments? > > Slooow. The best you can say is that they both (IIRC about glimpse) scale > linearly. You really need something dictionary-based. Somebody smart may want to build a new Postgres data type that does exactly that; Grab a text array, parse into words, sort into a dictionary, etc. Sounds like fun project. Should not take all year either. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 12:28:05 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id MAA12784 for freebsd-database-outgoing; Mon, 30 Mar 1998 12:28:05 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id MAA12715 for ; Mon, 30 Mar 1998 12:27:58 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 3648 invoked from network); 30 Mar 1998 20:37:11 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 20:37:11 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 30 Mar 1998 12:37:11 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: John Fieber Subject: RE: Mailing list search interface Cc: freebsd-database@FreeBSD.ORG Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 John Fieber wrote: > On Mon, 30 Mar 1998, Simon Shapiro wrote: > >> Truth must be told, currently PostgreSQL uses Unix files to store its >> indices and tables, so performance is not all that it could be. I am > > A properly constructed index for a full text database (read: NOT > glimpse) requires very little disk i/o for most queries. Eg, > prefix trie hashing requires about two reads per search term in > the query. I just read a paper describing some optimtzaion that > reduces that to one read about 50% of the time. A picture starts emerging here, folks. We normalize the normalizable and then build a datatype which knows to do dictionary based searches on the text. The excellent news here is that disk I/O per record can be reduced. This allows us to easily utilize more than one Unix instance/host per database. This gives us the memory and CPU bandwidth. This can turn really useful real fast. BTW, when considering text/scripts/database alternatives, think not only about generating the search indices, but query too. Descent RDBMS engines cache these things very well, in userspace. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 13:03:52 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA21921 for freebsd-database-outgoing; Mon, 30 Mar 1998 13:03:52 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA21906 for ; Mon, 30 Mar 1998 13:03:41 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id QAA08249; Mon, 30 Mar 1998 16:03:30 -0500 (EST) Date: Mon, 30 Mar 1998 16:03:30 -0500 (EST) From: John Fieber To: Simon Shapiro cc: freebsd-database@FreeBSD.ORG Subject: Mail indexing infrastructure In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998, Simon Shapiro wrote: > > The FreeBSD mailing list archive is 620MB large. There are currently > > 270,000 messages. The archive grow with 100,000 messages/year. > > Excellent. How many years back do we want to keep? The current indexed archive goes back to 1994. > Also, if the current engine is so great, how come all these people are > excited about replacing it? Thread retrieval and date scoping. However, most proposed solutions involve a wholesale replacement rather than augumenting what we have, which works pretty well, all told. Basically, the vector-space ranked retrieval we already have, possibly scoped by date, is the best way to start a search, followed by thread retrieval once a promising message has been found. Wolfram's home-brew solution for threads is more along the lines of what we need. I have working date scoping in prototype, but there are performance problems--freeWAIS really doesn't handle that sort of thing very well and I'm a bit concerned about killing www.freebsd.org with it because I know it will be a popular feature. I also have half a mind to provide relevance feedback (a "find more like this..." link) but my free time is much smaller than the things I have to fill it with. :( -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 13:27:43 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA25537 for freebsd-database-outgoing; Mon, 30 Mar 1998 13:27:43 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id NAA25507 for ; Mon, 30 Mar 1998 13:27:39 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 4686 invoked from network); 30 Mar 1998 21:36:52 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 21:36:52 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 30 Mar 1998 13:36:52 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: John Fieber Subject: RE: Mail indexing infrastructure Cc: freebsd-database@FreeBSD.ORG Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 John Fieber wrote: > On Mon, 30 Mar 1998, Simon Shapiro wrote: > >> > The FreeBSD mailing list archive is 620MB large. There are currently >> > 270,000 messages. The archive grow with 100,000 messages/year. >> >> Excellent. How many years back do we want to keep? > > The current indexed archive goes back to 1994. This is not an answer to my question :-) Currently we are keeping 4 years. Do we want to keep 40? 10? 5? Some (theoretical) limit has to be put. >> Also, if the current engine is so great, how come all these people are >> excited about replacing it? > > Thread retrieval and date scoping. However, most proposed > solutions involve a wholesale replacement rather than augumenting > what we have, which works pretty well, all told. If thread retrieval is based on Subject: line, an RDBMS is a trivially good solution. One can even apply regex to the subject, limit dates, etc. I admit having an interest in this which goes beyond mail archives search. In this context here, though, My RDBMS tilt can be viewed as intelectually satisfying. If the current system is good and should only be augmented, rather than replaced, this is fine by me. > Basically, the vector-space ranked retrieval we already have, > possibly scoped by date, is the best way to start a search, > followed by thread retrieval once a promising message has been > found. Wolfram's home-brew solution for threads is more along the > lines of what we need. Don't confuse solutions and problems. You currently have a text searching system, which you are happy with. Aside from that, not replacing it, not augmenting it, just pondering the problems that exist and wether an RDBMS soultion can be applied to such a problem. My takr on it is that until we actually build the core RDBMS schema for it, load it and run some tests, we will really not know if it is worth it, in the performace department. for other instances there are some other consideration, of course. > I have working date scoping in prototype, but there are > performance problems--freeWAIS really doesn't handle that sort of > thing very well and I'm a bit concerned about killing > www.freebsd.org with it because I know it will be a popular > feature. Of course. You will be doing ``full table scan'' for date scoping. > I also have half a mind to provide relevance feedback (a "find > more like this..." link) but my free time is much smaller than > the things I have to fill it with. :( Thsi is where RDBMS can help. You do not arrange the data for a query. You ``normalize'' the data. Queries come later, in unplanned for manner and are serviced with reasonable efficiency. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 13:37:02 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA27302 for freebsd-database-outgoing; Mon, 30 Mar 1998 13:37:02 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA27297 for ; Mon, 30 Mar 1998 13:36:51 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id QAA08329; Mon, 30 Mar 1998 16:36:25 -0500 (EST) Date: Mon, 30 Mar 1998 16:36:25 -0500 (EST) From: John Fieber To: Eivind Eklund cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface In-Reply-To: <19980330170411.35652@follo.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998, Eivind Eklund wrote: > We permit external archives, right? John, do you mind if I point them at > the exported raw data? No problem at all. :) I'm still interested in providing our own archive access regardless. -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 13:49:16 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA29198 for freebsd-database-outgoing; Mon, 30 Mar 1998 13:49:16 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from thelab.hub.org (tc-13.acadiau.ca [131.162.2.113]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA29172 for ; Mon, 30 Mar 1998 13:49:09 -0800 (PST) (envelope-from scrappy@hub.org) Received: from localhost (scrappy@localhost) by thelab.hub.org (8.8.8/8.8.2) with SMTP id RAA02021; Mon, 30 Mar 1998 17:48:45 -0400 (AST) X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs Date: Mon, 30 Mar 1998 17:48:45 -0400 (AST) From: The Hermit Hacker Reply-To: The Hermit Hacker To: Simon Shapiro cc: pgsql-hackers@postgresql.org, freebsd-database@FreeBSD.ORG Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998, Simon Shapiro wrote: > > On 30-Mar-98 The Hermit Hacker wrote: > > On Mon, 30 Mar 1998, Simon Shapiro wrote: > > > >> > >> On 30-Mar-98 Bob Bishop wrote: > >> > At 10:57 pm +0100 29/3/98, Simon Shapiro wrote: > >> >>.. > >> >>We have been playing with the idea of normalizing the archive into an > >> >>RDBMS. [etc] > >> > > >> > Great, but to be useful you need free-text search too. > >> > >> Yup. This is in the head-scratching stage still. I was thinking of > >> either > >> glimpse or maybe simply extracting the blob and applying regex to it. > >> Comments? > > > > Just checked into it, and *supposedly* there is a free-text search > > module for PostgreSQL available... > > Which? Where? This is good news. Saves some headaches. I've CC'd this into pgsql-hackers@postgresql.org ... Bruce pointed out that Maarten(?) was working on something like this... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 14:21:00 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id OAA04859 for freebsd-database-outgoing; Mon, 30 Mar 1998 14:21:00 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA04824 for ; Mon, 30 Mar 1998 14:20:53 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id RAA08505; Mon, 30 Mar 1998 17:20:33 -0500 (EST) Date: Mon, 30 Mar 1998 17:20:33 -0500 (EST) From: John Fieber To: Simon Shapiro cc: freebsd-database@FreeBSD.ORG Subject: RE: Mail indexing infrastructure In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998, Simon Shapiro wrote: > > The current indexed archive goes back to 1994. > > This is not an answer to my question :-) Currently we are keeping 4 years. > Do we want to keep 40? 10? 5? Some (theoretical) limit has to be put. Oh, I would say indefinately until there is a compelling reason to dump some. The more we have, however, the more essential date scoping becomes. I think it is already becoming a bit of a problem. > If thread retrieval is based on Subject: line, an RDBMS is a trivially good > solution. One can even apply regex to the subject, limit dates, etc. Good thread indexing is based on subjects, message-ids, dates and content. Quick-and-dirty thread retrieval is an easy RDBMS problem, good thread retrieval is rather more complex. For a nice summary outline of threading methods and their performance, see: Lewis, David; Knowles, Kimberly (1997). Theading Electronic Mail: A Preliminary Study. Information Processing & Management, 33(2):209-217. > If the current system is good and should only be augmented, > rather than replaced, this is fine by me. Let me re-phrase: most proposals to date do replacement without preservation of what is good with the current system. A wholesale replacement WITH preservation of what is good would be most welcome. I'd be the first to jump up and down with glee to find a viable alternative to freeWAIS for doing full text searches with stemming, soundex matching, automatic term weighting etc... freeWAIS is is a festering heap of bugs, but it is the best the free software world has. Postgres with a module offering similar functionality would make me one happy camper. -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 14:27:26 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id OAA06605 for freebsd-database-outgoing; Mon, 30 Mar 1998 14:27:26 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from phoenix.welearn.com.au (suebla.lnk.telstra.net [139.130.44.81]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA06507 for ; Mon, 30 Mar 1998 14:27:14 -0800 (PST) (envelope-from sue@phoenix.welearn.com.au) Received: (from sue@localhost) by phoenix.welearn.com.au (8.8.5/8.8.5) id IAA12783; Tue, 31 Mar 1998 08:27:06 +1000 (EST) Message-ID: <19980331082700.52299@welearn.com.au> Date: Tue, 31 Mar 1998 08:27:01 +1000 From: Sue Blake To: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface References: <19980330164024.47510@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.88e In-Reply-To: ; from John Fieber on Mon, Mar 30, 1998 at 11:03:39AM -0500 Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, Mar 30, 1998 at 11:03:39AM -0500, John Fieber wrote: > On Mon, 30 Mar 1998 nik@iii.co.uk wrote: > > Are those survey results available online somewhere? > > No, I'll have to dig a bit and they are probably not in a very > useful form. I'll have to fire up SPSS and generate some > reports... There was a survey? Somebody wants to know? Please forgive this brief unlurk. I don't understand what you're doing but it looks like a tremendous amount of effort. As a user I only see one problem with the archive search: it doesn't find what I ask for. I'd even be happier if it held only recent material and took a few minutes to present the results as boring text files; if it found what I ask for it'd be worth using. Now it doesn't, and it's not. Any improvement would be wonderful! Example 1: Yesterday cron said "Cannot fork" which was meaningless, even after looking at the cron-related man pages and trying apropos fork. So I searched for "cannot and fork" and nothing came back. "cron and fork" came up with a bunch of stuff which didn't relate to cron at all but mentioned "fork" in entirely different contexts, often including the words "cannot fork" which the previous search had failed to see. I became very frustrated, started shutting things down, cron sprang to life and the penny dropped :-) Example 2: In December I posted a question and received about 6 good replies, which I promptly lost. In January I tried to search for them, over and over, and could only find my original and one reply. Often searches reveal the question but no answers can be found by any method, answers that I know have been posted to -questions and contain the searched words. How you want to make it work, how fast, how much disk space and memory, how cool the method is, how it looks, matters more to you people than to me. To be successful it needs to reliably do what users expect, and users need to be made very clear about how to use it and what to expect from it. Forget the latter and your efforts will never get the appreciation they obviously deserve. Thanks everyone for trying to work out solution! -- Regards, -*Sue*- find / -name "*.conf" |more To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 14:49:15 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id OAA12170 for freebsd-database-outgoing; Mon, 30 Mar 1998 14:49:15 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from thumper.tmisnet.com (root@thumper.tmisnet.com [204.212.149.7]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA12151 for ; Mon, 30 Mar 1998 14:49:09 -0800 (PST) (envelope-from gryphon@healer.com) Received: from healer.com (195.west-palm-beach-01.fl.dial-access.att.net [12.70.38.195]) by thumper.tmisnet.com (8.8.5/8.8.5) with ESMTP id OAA23387; Mon, 30 Mar 1998 14:48:48 -0800 (PST) Message-ID: <352021B3.C910AFC7@healer.com> Date: Mon, 30 Mar 1998 17:50:27 -0500 From: Coranth Gryphon X-Mailer: Mozilla 4.04 [en] (Win95; U) MIME-Version: 1.0 To: Sue Blake CC: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface References: <19980330164024.47510@iii.co.uk> <19980331082700.52299@welearn.com.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk Sue Blake wrote: > To be successful it needs to reliably do what users expect, and users > need to be made very clear about how to use it and what to expect Or put another way, a tool is not useful if it's not used. -coranth ------------------------------------------+---------------------------- Coranth Gryphon | #include http://www.healer.com | #include ------------------------------------------+---------------------------- Fear will lead us to places that only Love can get us out of. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 15:13:43 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id PAA18005 for freebsd-database-outgoing; Mon, 30 Mar 1998 15:13:43 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id PAA17939 for ; Mon, 30 Mar 1998 15:13:33 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id SAA08605; Mon, 30 Mar 1998 18:13:05 -0500 (EST) Date: Mon, 30 Mar 1998 18:13:05 -0500 (EST) From: John Fieber To: Sue Blake cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface In-Reply-To: <19980331082700.52299@welearn.com.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Tue, 31 Mar 1998, Sue Blake wrote: > There was a survey? Somebody wants to know? Done quite some time ago. Yes, I want to know. If a system design isn't ultimately rooted in what real users actually need, then what good is it? > As a user I only see one > problem with the archive search: it doesn't find what I ask for. Well, you hit the perennial information retrieval problem smack on the head. To paraphrase an article from Wired Magazine some time ago, it is the type of problem many in the field of computer science feel could be solved over lunch if they put their mind to it. Well, all of us over in information retrieval (as a distinct discipline) eagerly await the solutions from those computer scientists! Artificial intelligence seems to be the only CS branch with a realistic understanding of the problem difficulty. > Example 1: Yesterday cron said "Cannot fork" which was meaningless, even > after looking at the cron-related man pages and trying apropos fork. So I > searched for "cannot and fork" and nothing came back. "cron and fork" came > up with a bunch of stuff which didn't relate to cron at all but mentioned > "fork" in entirely different contexts, often including the words "cannot > fork" which the previous search had failed to see. I became very frustrated, > started shutting things down, cron sprang to life and the penny dropped :-) Two possible things here. One, you may have hit some bugs in the search engine with the "cannot and fork" query. The search engine uses a vector space model with some boolean extensions crudely patched in and the two mechanisms don't mesh that well. Second is a problem which I personally think is substantial but generally disregarded by IR researchers. The more advanced and better performing search mechanisms involve complex algorithms which are opaque to the user. Thus, users can be puzzled and surprised by the results of a seemingly straight forward query. Furthermore, users can have considerably difficulty repairing a failed query when their experiments with repair lead to seemingly unpredictable results. Contrast this with a simple boolean mechanism. A shockingly large proportion of the general public cannot assemble a boolean query correctly, but for those that can, the justification for each document's presence in the result set is clear. When a query goes wrong, it is fairly straight forward to fix it and the fixes have predictable results. Figuring out some good solutions to this is one of my on-going research interests, though currently in the context of information filtering where the fine tuning of queries is even more critical than in retrieval. > Example 2: In December I posted a question and received about 6 good replies, > which I promptly lost. In January I tried to search for them, over and over, > and could only find my original and one reply. Often searches reveal the > question but no answers can be found by any method, answers that I know have > been posted to -questions and contain the searched words. This is a deep problem in IR: by definition you cannot accurately describe what you are looking for. If you could, then you wouldn't need to look for it! Thus, a system based on calculating similarity between query and document is doomed. As you experienced, you can describe and thus retrieve what you already know, but what you want is to describe the perimeter that surrounds what you don't know and have the system find what is in the middle that is missing from your query. For this *particular* application, a thread index is exactly what you needed: you could find your original posting because you knew what was in it, then you trace the followups which you couldn't find by a keyword search. Nicholas Belkin and friends have published a number of interesting papers on the topic, with some proposed solutions. -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Mon Mar 30 16:18:28 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id QAA06954 for freebsd-database-outgoing; Mon, 30 Mar 1998 16:18:28 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from phoenix.welearn.com.au (suebla.lnk.telstra.net [139.130.44.81]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA06932 for ; Mon, 30 Mar 1998 16:18:13 -0800 (PST) (envelope-from sue@phoenix.welearn.com.au) Received: (from sue@localhost) by phoenix.welearn.com.au (8.8.5/8.8.5) id KAA13231; Tue, 31 Mar 1998 10:15:40 +1000 (EST) Message-ID: <19980331101537.35326@welearn.com.au> Date: Tue, 31 Mar 1998 10:15:37 +1000 From: Sue Blake To: John Fieber Cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface References: <19980331082700.52299@welearn.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.88e In-Reply-To: ; from John Fieber on Mon, Mar 30, 1998 at 06:13:05PM -0500 Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, Mar 30, 1998 at 06:13:05PM -0500, John Fieber wrote: > On Tue, 31 Mar 1998, Sue Blake wrote: > > Example 2: In December I posted a question and received about 6 good replies, > > which I promptly lost. In January I tried to search for them, over and over, > > and could only find my original and one reply. Often searches reveal the > > question but no answers can be found by any method, answers that I know have > > been posted to -questions and contain the searched words. > > This is a deep problem in IR: by definition you cannot accurately > describe what you are looking for. If you could, then you > wouldn't need to look for it! Thus, a system based on > calculating similarity between query and document is doomed. As > you experienced, you can describe and thus retrieve what you > already know, but what you want is to describe the perimeter that > surrounds what you don't know and have the system find what is in > the middle that is missing from your query. > > For this *particular* application, a thread index is exactly what > you needed: you could find your original posting because you knew > what was in it, then you trace the followups which you couldn't > find by a keyword search. The problem you describe is the one I met in the first example when trying to use the man pages: I didn't know what it was about so I couldn't look for it. With threading I would still need to know the unknown to find an entry point to the thread, then from that point on it might be easier. But usually I can't get that far, unless there's an error message to search. You can't do much about that. In addition I would need to know damn well that what I search for will be found if it is there. You can do a lot about that. For the second problem (above) that was not the case at all. My post to -questions had one or two uncommon words in the subject, as did the replies. The replies quoted, at minimum, a particular couple of lines of my original post. Searching for words appearing in either of these places should have produced some result, I thought. Searching for all or part of my own name failed too. The problem in the second example is quite different to the one you mention. I don't think it's my problem, and even if threads were available I would have expected the techniques I had used to have worked in this case. If something is in there and I can name it, I do expect to get it. Whether that's a problem of design, lost data, bugs, or education makes not a scrap of difference out here. If the basic search part cannot be made to work nothing else will help much. -- Regards, -*Sue*- find / -name "*.conf" |more To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Tue Mar 31 00:31:48 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id AAA12338 for freebsd-database-outgoing; Tue, 31 Mar 1998 00:31:48 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from ns1.yes.no (ns1.yes.no [195.119.24.10]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id AAA12301 for ; Tue, 31 Mar 1998 00:31:38 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [194.198.43.36]) by ns1.yes.no (8.8.7/8.8.7) with ESMTP id JAA00814; Tue, 31 Mar 1998 09:30:46 GMT Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id KAA07899; Tue, 31 Mar 1998 10:31:18 +0200 (MET DST) Message-ID: <19980331102843.63722@follo.net> Date: Tue, 31 Mar 1998 10:28:43 +0200 From: Eivind Eklund To: John Fieber , Sue Blake Cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface References: <19980331082700.52299@welearn.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.89.1i In-Reply-To: ; from John Fieber on Mon, Mar 30, 1998 at 06:13:05PM -0500 Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, Mar 30, 1998 at 06:13:05PM -0500, John Fieber wrote: > Two possible things here. One, you may have hit some bugs in the > search engine with the "cannot and fork" query. The search > engine uses a vector space model with some boolean extensions > crudely patched in and the two mechanisms don't mesh that well. Does this mean my impression that my boolean queries "just don't work" is correct? I seem to always be retrieving half of what I thought I should. Oh, BTW: You're disipline is trivial - given unlimited CPU-power, unlimited bandwidth, and unlimited manpower to implement the solution, I'm quite certain I could make something much better than what is available today ;-) (The IR-model I hope will be able to interface-wise cope with large information spaces is a projection of a multi-variate analysis space, where you initially go by categories/keywords, and then do space-manipulations when you're somewhere in the vicinity of your target. Did I remember to repeat the 'unlimited resources' part?) Eivind. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Tue Mar 31 04:49:50 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id EAA08645 for freebsd-database-outgoing; Tue, 31 Mar 1998 04:49:50 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from tyree.iii.co.uk (tyree.iii.co.uk [195.89.149.230]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id EAA08639 for ; Tue, 31 Mar 1998 04:49:47 -0800 (PST) (envelope-from nik@iii.co.uk) From: nik@iii.co.uk Received: from carrig.strand.iii.co.uk (carrig.strand.iii.co.uk [192.168.7.25]) by tyree.iii.co.uk (8.8.8/8.8.8) with ESMTP id NAA10043; Tue, 31 Mar 1998 13:49:45 +0100 (BST) Received: (from nik@localhost) by carrig.strand.iii.co.uk (8.8.8/8.8.7) id NAA08533; Tue, 31 Mar 1998 13:49:29 +0100 (BST) Message-ID: <19980331134928.03855@iii.co.uk> Date: Tue, 31 Mar 1998 13:49:28 +0100 To: shimon@simon-shapiro.org Cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface References: <19980330164024.47510@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.85e In-Reply-To: ; from Simon Shapiro on Mon, Mar 30, 1998 at 12:25:00PM -0800 Organization: interactive investor Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, Mar 30, 1998 at 12:25:00PM -0800, Simon Shapiro wrote: > > The time above was for creation of the HTML archives and for indexing, > > not just indexing alone. > > This is something we need to keep in mind. Generating 100% output coverage > for (probably) less than 10% need is wasteful. True enough. I'm probably going to shut up in a second, because this is starting to get out of my depth. My background in this is as follows; at work I run a number of mailing lists (~1000 subscribers across all of them, and ~150 messages a day tops, so this is not close to the level of the FreeBSD lists). Getting a publically available archive has been on my to do list for a while. The quick solution[1] has involved MHonArc, Glimpse and Wilma (as the glue between the two of them). As a part of that implementation, I've had to make a bunch of changes to Wilma to make it more generic and remove some of the assumptions in the code. At the back of my mind was the thought that this would be quite useful for the FreeBSD site, which is about where I started paying attention to the messages in the pgaccess thread (!) about the mailing list search. It certainly looks like Wolfram and co. have been making a number of changes to the list archive software to make it more useful (and as John says, things like MHonArc and Glimpse will not scale well, although they're fine for my current problem) > > I'd try this with the whole of the archives, but I don't have the spare > > disk space (yet). > > I have. Is there an efficient way to get the whole archive here? > Downloading on a modem is NOT considered efficient. Don't know. I FTP it down at work, drop it on a Zip disk and take it home. > > A hybrid system is on my list of things to build here (but it'll be > > Oracle based). I haven't investigated Postgres enough to know if it's > > up to the task. > > Oracle based is good. Now, plase tell us how to run Oracle on FreeBSD, > legally, and with source available. You can't (yet). But should I get the chance to implement something Oracle based at work, I should be able to at least apply that learning to reimplementing it in Postgres. > PostgreSQL is up to the task. This is not a dramatically complex database > problem. Pretty much a linear table, with the text searching TBD. My initial thoughts on this (and I have done *very* little reading on the subject at the moment) is to use the database only for the threading information. Store all the message information somewhere else, with an efficient way to a) retrieve a single message from the message archive (based on a unique key, possibly the message-id (although I'm not 100% certain that that's guaranteed to be unique)). b) Search the entire message archive, returning the message keys that match the search. In the (Oracle|Postgres) database, have one table per archive that has the columns: ID - Unique ID for this message PARENT - ID of the parent message in this thread, NULL if this is the first message in the thread. STRICT - Is this in the thread because of strict threading (a references or in-reply-to line was present in the original message) or because the subject line matched? AUTHOR SUBJECT Given a particular message ID, you could then construct the thread tree from that message down (in Oracle) with SQL similar to: select ID, AUTHOR, SUBJECT from ARCHIVE_HACKERS connect by PARENT = prior ID start with ID = 'message_id'; I haven't tested that. As I say, I'm still in the very early planning stages of this, and have a lot of reading (and thinking) to do first. N [1] I don't like implementing 'quick solutions' any more than the rest of you, but I was outvoted on this one. -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Tue Mar 31 08:30:25 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id IAA14886 for freebsd-database-outgoing; Tue, 31 Mar 1998 08:30:25 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id IAA14881 for ; Tue, 31 Mar 1998 08:30:21 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 11305 invoked from network); 31 Mar 1998 16:39:41 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 31 Mar 1998 16:39:41 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <19980331134928.03855@iii.co.uk> Date: Tue, 31 Mar 1998 08:39:41 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: nik@iii.co.uk Subject: Re: Mailing list search interface Cc: freebsd-database@FreeBSD.ORG Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 31-Mar-98 nik@iii.co.uk wrote: ... > I'm probably going to shut up in a second, because this is starting to > get out of my depth. Nah. Your input IS valuable. > It certainly looks like Wolfram and co. have been making a number of > changes to the list archive software to make it more useful (and as John > says, things like MHonArc and Glimpse will not scale well, although > they're fine for my current problem) There are several issues for you to consider: a. Scalability; Rarely linear. Typically exponentialy degrading. A solution that is lighrning fast, economical and alltogether wonderful at one scale will fail miserably at another. b. Personal Taste; Some people love Oracle, some hate it. Some love Infomrix, some hate it. No way around that one. arguing and demonstrating can help, sometimes. c. Familiarity; Most Unix types are very familiar with text processing. Very few are truely versed in DBMS (Relational or otherwise). Also, if you have a system which produces comfortable results on 5 out of 6 issues, it is difficult to agree to switch the whole system. And rightly so. The missing functionality may not be critical, the fear of losing the first five, etc. are all valid considerations. As I said, my take on this issu is that I would like to try and prototype it into a database. a SQL based one. I am NOT saying it will be better, nor faster, as I sinmply do not know yet. I like the RDBMS option as it potentially has many runtime advantages. RDBMS tend to pay back on LARGE databases of complex nature. This is NOT true, as poor implementation can screw up beyond comprehension. >> Oracle based is good. Now, plase tell us how to run Oracle on FreeBSD, >> legally, and with source available. > > You can't (yet). But should I get the chance to implement something > Oracle based at work, I should be able to at least apply that learning > to reimplementing it in Postgres. Almost True. But, maybe I get the time and do it in Postgres. I think I know how to do it in Oracle already. Then you can learn from my experience :-) ... > My initial thoughts on this (and I have done *very* little reading on the > subject at the moment) is to use the database only for the threading > information. Store all the message information somewhere else, with an > efficient way to This is not necessarily true. .. I quite disagree with the design as you outlined it. It would have been OK for DBM or some such. I'll get back to you with my humble opinion, once I complete it. > [1] I don't like implementing 'quick solutions' any more than the rest of > you, but I was outvoted on this one. You were not outvoted, as there was no vote. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Tue Mar 31 08:49:06 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id IAA18716 for freebsd-database-outgoing; Tue, 31 Mar 1998 08:49:06 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from tyree.iii.co.uk (tyree.iii.co.uk [195.89.149.230]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA18683 for ; Tue, 31 Mar 1998 08:48:58 -0800 (PST) (envelope-from nik@iii.co.uk) From: nik@iii.co.uk Received: from carrig.strand.iii.co.uk (carrig.strand.iii.co.uk [192.168.7.25]) by tyree.iii.co.uk (8.8.8/8.8.8) with ESMTP id RAA01346; Tue, 31 Mar 1998 17:48:44 +0100 (BST) Received: (from nik@localhost) by carrig.strand.iii.co.uk (8.8.8/8.8.7) id RAA08984; Tue, 31 Mar 1998 17:48:28 +0100 (BST) Message-ID: <19980331174827.40133@iii.co.uk> Date: Tue, 31 Mar 1998 17:48:27 +0100 To: shimon@simon-shapiro.org Cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface References: <19980331134928.03855@iii.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.85e In-Reply-To: ; from Simon Shapiro on Tue, Mar 31, 1998 at 08:39:41AM -0800 Organization: interactive investor Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Tue, Mar 31, 1998 at 08:39:41AM -0800, Simon Shapiro wrote: > b. Personal Taste; Some people love Oracle, some hate it. I'm ambivalent about it. It is, however, what I get to use at work. > I quite disagree with the design as you outlined it. It would have been OK > for DBM or some such. I'll get back to you with my humble opinion, once I > complete it. Great. It's appreciated. > > [1] I don't like implementing 'quick solutions' any more than the rest of > > you, but I was outvoted on this one. > > You were not outvoted, as there was no vote. Sorry, I should have been clearer. I meant I was outvoted at work. I've had nothing to do with the design and implementation of FreeBSD's mailing list archive search. And kudos to those who have. N -- Work: nik@iii.co.uk | FreeBSD + Perl + Apache Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need Play: nik@freebsd.org | Microsoft? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Tue Mar 31 09:03:57 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id JAA21609 for freebsd-database-outgoing; Tue, 31 Mar 1998 09:03:57 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id JAA21604 for ; Tue, 31 Mar 1998 09:03:54 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 21526 invoked from network); 31 Mar 1998 17:13:15 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 31 Mar 1998 17:13:15 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <19980331174827.40133@iii.co.uk> Date: Tue, 31 Mar 1998 09:13:15 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: nik@iii.co.uk Subject: Re: Mailing list search interface Cc: freebsd-database@FreeBSD.ORG Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 31-Mar-98 nik@iii.co.uk wrote: > On Tue, Mar 31, 1998 at 08:39:41AM -0800, Simon Shapiro wrote: >> b. Personal Taste; Some people love Oracle, some hate it. > > I'm ambivalent about it. It is, however, what I get to use at work. Don't misunderstand me. It is a GOOD RDBMS. Sort of scary when you look at the source, but amazingly functional. >> I quite disagree with the design as you outlined it. It would have been >> OK >> for DBM or some such. I'll get back to you with my humble opinion, once >> I >> complete it. > > Great. It's appreciated. I'll be looking at it in the next few weeks. Too much going on right now. >> You were not outvoted, as there was no vote. > > Sorry, I should have been clearer. I meant I was outvoted at work. I've > had nothing to do with the design and implementation of FreeBSD's mailing > list archive search. And kudos to those who have. Do not feel bad. Happens to us all. Simon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Wed Apr 1 05:59:58 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id FAA01923 for freebsd-database-outgoing; Wed, 1 Apr 1998 05:59:58 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from mail.actrix.gen.nz (root@mail.actrix.gen.nz [203.96.16.37]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id FAA01914 for ; Wed, 1 Apr 1998 05:59:54 -0800 (PST) (envelope-from andrew@squiz.co.nz) Received: from [192.168.1.1] (aniwa.actrix.gen.nz [203.96.56.186]) by mail.actrix.gen.nz (8.8.8/8.8.5) with SMTP id BAA06318 for ; Thu, 2 Apr 1998 01:59:44 +1200 (NZST) X-Sender: squiz1@mail.actrix.gen.nz Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 2 Apr 1998 02:01:41 +1200 To: freebsd-database@FreeBSD.ORG From: andrew@squiz.co.nz (Andrew McNaughton) Subject: Re: Mailing list search interface Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk I'm currently re-building a current affairs news web site to use a mysql database. The kinds of search issues discussed regarding the freebsd mailing list archives are among the issues I'm facing, and if the project goes ahead as discussed here, I'd be interested in taking part (though I suspect I'm probably outclassed by other programmers here). As I say, so far I'm using mysql. It seems the discussion here has been about postgreSQL with references to oracle. Can someone give me or point me to a comparison between postgreSQL and mysql. Is the emphasis on postgreSQL due to licensing differences or functional ones? A quick look over the postgresSQL suggests that it has several query language features that mysql lacks (eg sub queries), and the docs place more emphasis on extension. It appears it does not do multithreading. How do they compare on speed? Postgres used to be slow, but I gather that postgreSQL is better. Andrew McNaughton DISCLAIMER: The Entire Physical Universe, Including Andrew McNaughton This Message, May One Day Collapse Back into an ++64 4 389 6891 Infinitesimally Small Space. Should Another Universe andrew@squiz.co.nz Subsequently Re-emerge, the Validity of Statements http://www.squiz.co.nz in This Message Cannot Be Guaranteed. http://www.newsroom.co.nz To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Wed Apr 1 09:16:52 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id JAA08706 for freebsd-database-outgoing; Wed, 1 Apr 1998 09:16:52 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id JAA08477 for ; Wed, 1 Apr 1998 09:15:38 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 743 invoked from network); 1 Apr 1998 17:22:52 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 1 Apr 1998 17:22:52 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032998 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Wed, 01 Apr 1998 09:22:51 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: (Andrew McNaughton) Subject: Re: Mailing list search interface Cc: freebsd-database@FreeBSD.ORG Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 01-Apr-98 Andrew McNaughton wrote: > As I say, so far I'm using mysql. It seems the discussion here has been > about postgreSQL with references to oracle. Can someone give me or point > me to a comparison between postgreSQL and mysql. Print the docs for both, take a day off and read :-) Build both, install both, and run both. > Is the emphasis on postgreSQL due to licensing differences or functional > ones? For me, it is only one of the issues. The others are that Posgres is a much easier candidate for enhancements towards High Availability Server. > A quick look over the postgresSQL suggests that it has several query > language features that mysql lacks (eg sub queries), and the docs place > more emphasis on extension. It appears it does not do multithreading. Very true. Postgres is very extendable, both in the upper layer (new types, indices, methods, rules, etc.), the logic layer (completeness of the relational model and SQL compliance). It does not do multithreading, but it does multi-processing. Each database connection is a process. A front-end (application) can maintain multiple database connection. Since the interaction between database access threads is via shared memory, this is not a problem. The lack of threads (probably historical) forced the definition of an indpendant lock manager, and independant and modular storage manager. This allows people like myself to easily deveop different locking and storage facilities, which, again, allow me to contemplate clustered, multi-host access. > How do they compare on speed? Postgres used to be slow, but I gather > that postgreSQL is better. If your funtionality can be served by mysql, you probably should use it. It is smaller, simpler, and thus probably faster. If you need the functionality or Features that Postgres offers, then sheer speed is not the only consideration. I suspect most of the speed issue will be resolved with the new lovk manager I wrote, and the new storage manager I am writing. Simon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message From owner-freebsd-database Fri Apr 3 21:18:46 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id VAA27047 for freebsd-database-outgoing; Fri, 3 Apr 1998 21:18:46 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from relay.nuxi.com (nuxi.cs.ucdavis.edu [128.120.56.38]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id VAA27039; Fri, 3 Apr 1998 21:18:43 -0800 (PST) (envelope-from obrien@dragon.nuxi.com) Received: from dragon.nuxi.com (d60-090.leach.ucdavis.edu [169.237.60.90]) by relay.nuxi.com (8.8.7/8.6.12) with ESMTP id VAA12960; Fri, 3 Apr 1998 21:18:37 -0800 (PST) Received: (from obrien@localhost) by dragon.nuxi.com (8.8.8/8.7.3) id FAA05989; Sat, 4 Apr 1998 05:18:35 GMT Message-ID: <19980403211835.48113@nuxi.com> Date: Fri, 3 Apr 1998 21:18:35 -0800 From: "David O'Brien" To: Eivind Eklund Cc: John Fieber , Amancio Hasty , ports@FreeBSD.ORG, freebsd-database@FreeBSD.ORG Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Reply-To: obrien@NUXI.com References: <199803250450.UAA27558@rah.star-gate.com> <19980325162504.05717@follo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.88 In-Reply-To: <19980325162504.05717@follo.net>; from Eivind Eklund on Wed, Mar 25, 1998 at 04:25:04PM +0100 X-Warning: Mutt Bites! X-Operating-System: FreeBSD 2.2.6-STABLE Organization: The NUXI *BSD group X-PGP-Fingerprint: B7 4D 3E E9 11 39 5F A3 90 76 5D 69 58 D9 98 7A X-Pgp-Keyid: 34F9F9D5 Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk > www.findmail.com has offered to archive the lists. They have all the > databases and stuff needed, and I'd guess they also could take the old > message log. Yes we have all the old mailing list traffic archived on hub:/home/mail/archive. That should be all they would need. -- -- David (obrien@NUXI.com -or- obrien@FreeBSD.org) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message