Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Mar 1998 10:26:29 -0800 (PST)
From:      Simon Shapiro <shimon@simon-shapiro.org>
To:        nik@iii.co.uk
Cc:        Amancio Hasty <hasty@rah.star-gate.com>, Satoshi Asami <asami@FreeBSD.ORG>, scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider <wosch@cs.tu-berlin.de>
Subject:   RE: Mailing list search interface
Message-ID:  <XFMail.980330102629.shimon@simon-shapiro.org>
In-Reply-To: <19980330110200.17368@iii.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
I have no strong opinions in this matter.  My experience with indexing
methods residing in (essentially) flat Unix files is that they do not scale
well.  This is what database engines are for.

Truth must be told, currently PostgreSQL uses Unix files to store its
indices and tables, so performance is not all that it could be.  I am
working on building a raw device storage manager for PostgreSQL, which will
allow shared access (cluster like) and much faster speed.  The only issue I
have not settled on is how to search the message bodies.  Maybe I get some
free time soon and will try few things.

What is an acceptable search rate?  For header type data?  For body regex?

BTW, if your project is almost ready, go ahead with it.  It does not
conflict at all with what I am thinking of.

On 30-Mar-98 nik@iii.co.uk wrote:
> Gents,
> 
> On Sun, Mar 29, 1998 at 01:57:30PM -0800, Simon Shapiro wrote:
>> On 26-Mar-98 Wolfram Schneider wrote:
>> > The FreeBSD mailing list search interface support threads. The
>> > thread database will be updated hourly. Of course there are
>> > many things to do to make the threads more user friendly.
>> 
>> We have been playing with the idea of normalizing the archive into an
>> RDBMS.  Some of the benefits are:
> 
> <snip>
> 
> Could we coordinate on some of this? I've been working on a system (at
> work) for making some of our mailing list archives visible and searchable
> on our internal site. I'm using MHonArc, Glimpse (both of which are in 
> the ports tree) and a customised version of Wilma 
> 
>     <URL:http://www.hpc.uh.edu/majordomo/#wilma>;
> 
> and it's almost at the point where this would be useful for the project.
> 
> I mentioned MHonArc to Jordan, and his first response was 
> 
>> Eeek!  The evil MHonArc resurfaces! ;-)
>>
>> It doesn't scale at all well - just try MHonArc'ing a really big mailing
>> list archive.  You soon get a set of monster html files that are
>> essentially unusable - I know, I did the short-lived "FreeBSD Docs"
>> CD for awhile using MHonArc.
> 
> I think he's been using an older version of MHonArc. I did some tests
> late last week, archiving and indexing the archives for -hackers from
> the beginning of 1998. That's 11,265K or thereabouts.
> 
> At the end of the conversion (which consisted of running MHonArc 2.2.0   
> over the files, and then using Glimpse 4.1 to index them) I had a total  
> of 32,910K HTML and index files.                                         
>                                                                          
> The output of 'time -l' on the conversion process was:                   
>                                                                          
>       626.11 real       438.83 user        93.13 sys                     
>       8572  maximum resident set size                                    
>        390  average shared memory size                                   
>       4311  average unshared data size                                   
>        128  average unshared stack size                                  
>    1054806  page reclaims                                                
>         68  page faults                                                  
>          0  swaps                                                        
>       9725  block input operations                                       
>       6115  block output operations 
>          0  messages sent                                                
>          0  messages received                                            
>          0  signals received                                             
>      18065  voluntary context switches                                   
>      26547  involuntary context switches                                 
>                                                                          
> That's a reasonably exceptional time, because it had to build the archive
> for the year to date, and you only take this hit once. Once the archive  
> is up and running, you're only building HTML files for new messages since
> the last update, which is (or should be) considerably faster.            
>                                                                          
> Regrettably at the moment, there's a bug in Glimpse 4.1, which means that
> you need to reindex the entire archive, rather than just those bits that 
> change. Fortunately, there are command line switches to tell the         
> glimpseindex program how much memory to use.                             
>                                                                          
> That 8572 max. resident size figure is from MHonArc rather than glimpse, 
> since it reads in (as far as I can tell) the whole of the mail archive   
> file before processing it. 
> 
> While the conversion was happening the load on my machine hovered around 
> the .9-1.1 mark. With X, Netscape, XEmacs and a bunch of xterms open.    
>                                                                          
> At the end of the conversion process I had a threaded copy of the
> -hackers      
> mail archives going back almost three months.                            
>                                                                          
> Each month has two indices -- a date index where you see all the messages
> in the order they came in, and a threaded index.                         
>                                                                          
> Each index shows (at most) 200 messages (that's a configurable number).  
> This is so the size of the index files doesn't grow without end. Each    
> index shows a "This is page x of y of the threaded index" comment, with  
> navigation text to go backwards and forwards in the index.               
>                                                                          
> This whole thing is searchable, allowing searches by combination of      
> keywords. You can specify the the number of misspellings to allow, the   
> number of hits to return, case sensitivity, and which months to restrict 
> your search to.
> 
> The only thing you can't do (at the moment) is search across more than
> one      
> mailing list. It shouldn't be too hard to add. Right now, I don't have a 
> URL I can give to show you the results, since I ran out of time last
> night      
> (I must be getting old, I used to be able to do 72 hour coding runs and
> not     
> really feel it <sigh>). I should be able to get something demonstrable   
> up on my freefall account by the middle of next week.                    
>                                                                          
> In light of all that, do you think this is worth pursuing further?
> 
> Thoughts?
> 
> N
> -- 
> Work: nik@iii.co.uk                       | FreeBSD + Perl + Apache
> Rest: nik@nothing-going-on.demon.co.uk    | Remind me again why we need
> Play: nik@freebsd.org                     | Microsoft?

----------


Sincerely Yours, 

Simon Shapiro
Shimon@Simon-Shapiro.ORG                      Voice:   503.799.2313

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.980330102629.shimon>