Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Mar 1998 18:13:05 -0500 (EST)
From:      John Fieber <jfieber@indiana.edu>
To:        Sue Blake <sue@welearn.com.au>
Cc:        freebsd-database@FreeBSD.ORG
Subject:   Re: Mailing list search interface
Message-ID:  <Pine.BSF.3.96.980330173718.8294D-100000@fallout.campusview.indiana.edu>
In-Reply-To: <19980331082700.52299@welearn.com.au>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 31 Mar 1998, Sue Blake wrote:

> There was a survey? Somebody wants to know?

Done quite some time ago.  Yes, I want to know.  If a system
design isn't ultimately rooted in what real users actually need,
then what good is it? 

> As a user I only see one
> problem with the archive search: it doesn't find what I ask for.

Well, you hit the perennial information retrieval problem smack
on the head.  To paraphrase an article from Wired Magazine some
time ago, it is the type of problem many in the field of computer
science feel could be solved over lunch if they put their mind to
it.  Well, all of us over in information retrieval (as a distinct
discipline) eagerly await the solutions from those computer
scientists!  Artificial intelligence seems to be the only CS
branch with a realistic understanding of the problem difficulty. 

> Example 1: Yesterday cron said "Cannot fork" which was meaningless, even
> after looking at the cron-related man pages and trying apropos fork. So I
> searched for "cannot and fork" and nothing came back. "cron and fork" came
> up with a bunch of stuff which didn't relate to cron at all but mentioned
> "fork" in entirely different contexts, often including the words "cannot
> fork" which the previous search had failed to see. I became very frustrated,
> started shutting things down, cron sprang to life and the penny dropped :-)

Two possible things here.  One, you may have hit some bugs in the
search engine with the "cannot and fork" query.  The search
engine uses a vector space model with some boolean extensions
crudely patched in and the two mechanisms don't mesh that well. 

Second is a problem which I personally think is substantial but
generally disregarded by IR researchers. The more advanced and
better performing search mechanisms involve complex
algorithms which are opaque to the user.  Thus, users can be
puzzled and surprised by the results of a seemingly straight
forward query.  Furthermore, users can have considerably
difficulty repairing a failed query when their experiments with
repair lead to seemingly unpredictable results.

Contrast this with a simple boolean mechanism.  A shockingly
large proportion of the general public cannot assemble a boolean
query correctly, but for those that can, the justification for
each document's presence in the result set is clear.  When a
query goes wrong, it is fairly straight forward to fix it and the
fixes have predictable results.

Figuring out some good solutions to this is one of my on-going
research interests, though currently in the context of
information filtering where the fine tuning of queries is even
more critical than in retrieval.

> Example 2: In December I posted a question and received about 6 good replies,
> which I promptly lost. In January I tried to search for them, over and over,
> and could only find my original and one reply. Often searches reveal the
> question but no answers can be found by any method, answers that I know have
> been posted to -questions and contain the searched words.

This is a deep problem in IR: by definition you cannot accurately
describe what you are looking for.  If you could, then you
wouldn't need to look for it!  Thus, a system based on
calculating similarity between query and document is doomed.  As
you experienced, you can describe and thus retrieve what you
already know, but what you want is to describe the perimeter that
surrounds what you don't know and have the system find what is in
the middle that is missing from your query.

For this *particular* application, a thread index is exactly what
you needed: you could find your original posting because you knew
what was in it, then you trace the followups which you couldn't
find by a keyword search.

Nicholas Belkin and friends have published a number of
interesting papers on the topic, with some proposed solutions. 

-john


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.96.980330173718.8294D-100000>