Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Mar 1998 11:02:00 +0100
From:      nik@iii.co.uk
To:        shimon@simon-shapiro.org
Cc:        Wolfram Schneider <wosch@cs.tu-berlin.de>, freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami <asami@FreeBSD.ORG>, Amancio Hasty <hasty@rah.star-gate.com>
Subject:   Mailing list search interface
Message-ID:  <19980330110200.17368@iii.co.uk>
In-Reply-To: <XFMail.980329135730.shimon@simon-shapiro.org>; from Simon Shapiro on Sun, Mar 29, 1998 at 01:57:30PM -0800
References:  <p1i3eg5jdbb.fsf@panke.panke.de> <XFMail.980329135730.shimon@simon-shapiro.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Gents,

On Sun, Mar 29, 1998 at 01:57:30PM -0800, Simon Shapiro wrote:
> On 26-Mar-98 Wolfram Schneider wrote:
> > The FreeBSD mailing list search interface support threads. The
> > thread database will be updated hourly. Of course there are
> > many things to do to make the threads more user friendly.
> 
> We have been playing with the idea of normalizing the archive into an
> RDBMS.  Some of the benefits are:

<snip>

Could we coordinate on some of this? I've been working on a system (at
work) for making some of our mailing list archives visible and searchable
on our internal site. I'm using MHonArc, Glimpse (both of which are in 
the ports tree) and a customised version of Wilma 

    <URL:http://www.hpc.uh.edu/majordomo/#wilma>;

and it's almost at the point where this would be useful for the project.

I mentioned MHonArc to Jordan, and his first response was 

> Eeek!  The evil MHonArc resurfaces! ;-)
>
> It doesn't scale at all well - just try MHonArc'ing a really big mailing
> list archive.  You soon get a set of monster html files that are
> essentially unusable - I know, I did the short-lived "FreeBSD Docs"
> CD for awhile using MHonArc.

I think he's been using an older version of MHonArc. I did some tests
late last week, archiving and indexing the archives for -hackers from
the beginning of 1998. That's 11,265K or thereabouts.

At the end of the conversion (which consisted of running MHonArc 2.2.0          
over the files, and then using Glimpse 4.1 to index them) I had a total         
of 32,910K HTML and index files.                                                
                                                                                
The output of 'time -l' on the conversion process was:                          
                                                                                
      626.11 real       438.83 user        93.13 sys                            
      8572  maximum resident set size                                           
       390  average shared memory size                                          
      4311  average unshared data size                                          
       128  average unshared stack size                                         
   1054806  page reclaims                                                       
        68  page faults                                                         
         0  swaps                                                               
      9725  block input operations                                              
      6115  block output operations 
         0  messages sent                                                       
         0  messages received                                                   
         0  signals received                                                    
     18065  voluntary context switches                                          
     26547  involuntary context switches                                        
                                                                                
That's a reasonably exceptional time, because it had to build the archive       
for the year to date, and you only take this hit once. Once the archive         
is up and running, you're only building HTML files for new messages since       
the last update, which is (or should be) considerably faster.                   
                                                                                
Regrettably at the moment, there's a bug in Glimpse 4.1, which means that       
you need to reindex the entire archive, rather than just those bits that        
change. Fortunately, there are command line switches to tell the                
glimpseindex program how much memory to use.                                    
                                                                                
That 8572 max. resident size figure is from MHonArc rather than glimpse,        
since it reads in (as far as I can tell) the whole of the mail archive          
file before processing it. 

While the conversion was happening the load on my machine hovered around        
the .9-1.1 mark. With X, Netscape, XEmacs and a bunch of xterms open.           
                                                                                
At the end of the conversion process I had a threaded copy of the -hackers      
mail archives going back almost three months.                                   
                                                                                
Each month has two indices -- a date index where you see all the messages       
in the order they came in, and a threaded index.                                
                                                                                
Each index shows (at most) 200 messages (that's a configurable number).         
This is so the size of the index files doesn't grow without end. Each           
index shows a "This is page x of y of the threaded index" comment, with         
navigation text to go backwards and forwards in the index.                      
                                                                                
This whole thing is searchable, allowing searches by combination of             
keywords. You can specify the the number of misspellings to allow, the          
number of hits to return, case sensitivity, and which months to restrict        
your search to.

The only thing you can't do (at the moment) is search across more than one      
mailing list. It shouldn't be too hard to add. Right now, I don't have a        
URL I can give to show you the results, since I ran out of time last night      
(I must be getting old, I used to be able to do 72 hour coding runs and not     
really feel it <sigh>). I should be able to get something demonstrable          
up on my freefall account by the middle of next week.                           
                                                                                
In light of all that, do you think this is worth pursuing further?

Thoughts?

N
-- 
Work: nik@iii.co.uk                       | FreeBSD + Perl + Apache
Rest: nik@nothing-going-on.demon.co.uk    | Remind me again why we need
Play: nik@freebsd.org                     | Microsoft?

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980330110200.17368>