Date: Mon, 30 Mar 1998 18:13:05 -0500 (EST) From: John Fieber <jfieber@indiana.edu> To: Sue Blake <sue@welearn.com.au> Cc: freebsd-database@FreeBSD.ORG Subject: Re: Mailing list search interface Message-ID: <Pine.BSF.3.96.980330173718.8294D-100000@fallout.campusview.indiana.edu> In-Reply-To: <19980331082700.52299@welearn.com.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 31 Mar 1998, Sue Blake wrote: > There was a survey? Somebody wants to know? Done quite some time ago. Yes, I want to know. If a system design isn't ultimately rooted in what real users actually need, then what good is it? > As a user I only see one > problem with the archive search: it doesn't find what I ask for. Well, you hit the perennial information retrieval problem smack on the head. To paraphrase an article from Wired Magazine some time ago, it is the type of problem many in the field of computer science feel could be solved over lunch if they put their mind to it. Well, all of us over in information retrieval (as a distinct discipline) eagerly await the solutions from those computer scientists! Artificial intelligence seems to be the only CS branch with a realistic understanding of the problem difficulty. > Example 1: Yesterday cron said "Cannot fork" which was meaningless, even > after looking at the cron-related man pages and trying apropos fork. So I > searched for "cannot and fork" and nothing came back. "cron and fork" came > up with a bunch of stuff which didn't relate to cron at all but mentioned > "fork" in entirely different contexts, often including the words "cannot > fork" which the previous search had failed to see. I became very frustrated, > started shutting things down, cron sprang to life and the penny dropped :-) Two possible things here. One, you may have hit some bugs in the search engine with the "cannot and fork" query. The search engine uses a vector space model with some boolean extensions crudely patched in and the two mechanisms don't mesh that well. Second is a problem which I personally think is substantial but generally disregarded by IR researchers. The more advanced and better performing search mechanisms involve complex algorithms which are opaque to the user. Thus, users can be puzzled and surprised by the results of a seemingly straight forward query. Furthermore, users can have considerably difficulty repairing a failed query when their experiments with repair lead to seemingly unpredictable results. Contrast this with a simple boolean mechanism. A shockingly large proportion of the general public cannot assemble a boolean query correctly, but for those that can, the justification for each document's presence in the result set is clear. When a query goes wrong, it is fairly straight forward to fix it and the fixes have predictable results. Figuring out some good solutions to this is one of my on-going research interests, though currently in the context of information filtering where the fine tuning of queries is even more critical than in retrieval. > Example 2: In December I posted a question and received about 6 good replies, > which I promptly lost. In January I tried to search for them, over and over, > and could only find my original and one reply. Often searches reveal the > question but no answers can be found by any method, answers that I know have > been posted to -questions and contain the searched words. This is a deep problem in IR: by definition you cannot accurately describe what you are looking for. If you could, then you wouldn't need to look for it! Thus, a system based on calculating similarity between query and document is doomed. As you experienced, you can describe and thus retrieve what you already know, but what you want is to describe the perimeter that surrounds what you don't know and have the system find what is in the middle that is missing from your query. For this *particular* application, a thread index is exactly what you needed: you could find your original posting because you knew what was in it, then you trace the followups which you couldn't find by a keyword search. Nicholas Belkin and friends have published a number of interesting papers on the topic, with some proposed solutions. -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.96.980330173718.8294D-100000>