From owner-freebsd-www Fri Aug 29 06:53:03 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id GAA04346 for www-outgoing; Fri, 29 Aug 1997 06:53:03 -0700 (PDT) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id GAA04341 for ; Fri, 29 Aug 1997 06:52:58 -0700 (PDT) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id IAA13436; Fri, 29 Aug 1997 08:52:38 -0500 (EST) Date: Fri, 29 Aug 1997 08:52:38 -0500 (EST) From: John Fieber To: Stefan Bethke cc: "Jordan K. Hubbard" , www@FreeBSD.ORG Subject: Re: Something I've always wanted to see with the mailing list search In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-www@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk On Fri, 29 Aug 1997, Stefan Bethke wrote: > At 12:00 Uhr +0200 29.08.1997, Jordan K. Hubbard wrote: > >Is thread preservation, e.g. when I do a search for "laptop AND IBM" "Preservation" isn't quite the right term here...it is more like thread discovery. A thread is primairly a human construct that is loosely represented in the actual messages with In-reply-to, references and subject lines. DejaNews does the best that I've seen, but it still leaves some to be desired. For a good discussion of the issues, see: David D. Lewis and Kimberly A Knowles (1997) Threading electronic mail: a preliminary study. Information Processing & Management, 33(2):209-217. It turns out that breaking messages down in to quoted and unquoted chunks, indexing them separately, and using vector space similarity measures (what freeWAIS uses) for retrieval is more accurate in retrieving what a human would consider to be a message thread than following subject lines, in-reply-to or references fields. In the absence of those fields, it is really the only way to discover a thread. > Yes, definitly. Something like DejaNews would be nice. I just haven't found > the time to do anything, but my idea is like this: > > To find articles fast, you need to keep an index (into the raw files). > Probably by message-id. This index would also contain the basic info about > an article, such as subject, author, date, thread-id. I've been puttering around with using postgres for thread indexing using these field, but have not spent much time on it. Some preliminary experiments with freeWAIS (used for keyword searching) show some possibility, but even simple date restrictions in searches really stretch the boundaries of what freeWAIS was designed to do efficiently. As for constructing threads at index time, this may be best for efficiency but extra care must be give to how threads are represented. For example, an "in-reply-to" linked tree may contain several distinct, but related threads. It should be possible to get at the sub-threads individually, as well as the larger thread. This means that any message may have multiple thread membership, either directly or indirectly via some thread record with pointers to parent/child threads. Ultimately, I would hope for thread discovery at search time rather than indexing time because it offers much more flexibility in tweaking various dimensions of the thread concept--broadening or narrowing the boundaries, building threads that cross boundaries between in-reply-to message trees, etc.... What makes this problem difficult is the lack of off-the-shelf software. (And things like hypermail that don't scale are disqualified.) A relational sort of database is necessary for some parts of the problem, a text database for others but neither one is sufficient. The world would benefit greatly from good message threading software. Today, the individual message is the unit of retrieval, but from a human point of view, the thread should be. -john