From owner-freebsd-www  Fri Aug 29 06:53:03 1997
Return-Path: <owner-freebsd-www>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id GAA04346
          for www-outgoing; Fri, 29 Aug 1997 06:53:03 -0700 (PDT)
Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id GAA04341
          for <www@FreeBSD.ORG>; Fri, 29 Aug 1997 06:52:58 -0700 (PDT)
Received: from localhost (jfieber@localhost)
	by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id IAA13436;
	Fri, 29 Aug 1997 08:52:38 -0500 (EST)
Date: Fri, 29 Aug 1997 08:52:38 -0500 (EST)
From: John Fieber <jfieber@indiana.edu>
To: Stefan Bethke <stefan@promo.de>
cc: "Jordan K. Hubbard" <jkh@time.cdrom.com>, www@FreeBSD.ORG
Subject: Re: Something I've always wanted to see with the mailing list search
In-Reply-To: <l03102801b02c54a7fe9c@[194.45.188.81]>
Message-ID: <Pine.BSF.3.96.970829075012.341E-100000@fallout.campusview.indiana.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-www@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

On Fri, 29 Aug 1997, Stefan Bethke wrote:

> At 12:00 Uhr +0200 29.08.1997, Jordan K. Hubbard wrote:

> >Is thread preservation, e.g. when I do a search for "laptop AND IBM"

"Preservation" isn't quite the right term here...it is more like
thread discovery.  A thread is primairly a human construct that
is loosely represented in the actual messages with In-reply-to,
references and subject lines. DejaNews does the best that I've
seen, but it still leaves some to be desired.

For a good discussion of the issues, see:

  David D. Lewis and Kimberly A Knowles (1997) Threading
  electronic mail: a preliminary study.  Information Processing &
  Management, 33(2):209-217.

It turns out that breaking messages down in to quoted and
unquoted chunks, indexing them separately, and using vector space
similarity measures (what freeWAIS uses) for retrieval is more
accurate in retrieving what a human would consider to be a
message thread than following subject lines, in-reply-to or
references fields. In the absence of those fields, it is really
the only way to discover a thread. 

> Yes, definitly. Something like DejaNews would be nice. I just haven't found
> the time to do anything, but my idea is like this:
> 
> To find articles fast, you need to keep an index (into the raw files).
> Probably by message-id. This index would also contain the basic info about
> an article, such as subject, author, date, thread-id.

I've been puttering around with using postgres for thread
indexing using these field, but have not spent much time on it. 
Some preliminary experiments with freeWAIS (used for keyword
searching) show some possibility, but even simple date
restrictions in searches really stretch the boundaries of what
freeWAIS was designed to do efficiently.

As for constructing threads at index time, this may be best for
efficiency but extra care must be give to how threads are
represented.  For example, an "in-reply-to" linked tree may
contain several distinct, but related threads.  It should be
possible to get at the sub-threads individually, as well as the
larger thread.  This means that any message may have multiple
thread membership, either directly or indirectly via some thread
record with pointers to parent/child threads.  Ultimately, I
would hope for thread discovery at search time rather than
indexing time because it offers much more flexibility in tweaking
various dimensions of the thread concept--broadening or narrowing
the boundaries, building threads that cross boundaries between
in-reply-to message trees, etc....

What makes this problem difficult is the lack of off-the-shelf
software.  (And things like hypermail that don't scale are
disqualified.) A relational sort of database is necessary for
some parts of the problem, a text database for others but neither
one is sufficient.

The world would benefit greatly from good message threading
software.  Today, the individual message is the unit of
retrieval, but from a human point of view, the thread should be. 

-john