Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Mar 1998 12:02:45 -0800 (PST)
From:      Simon Shapiro <shimon@simon-shapiro.org>
To:        John Fieber <jfieber@indiana.edu>
Cc:        Amancio Hasty <hasty@rah.star-gate.com>, Satoshi Asami <asami@FreeBSD.ORG>, scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider <wosch@cs.tu-berlin.de>
Subject:   Re: [PORTS] Pgaccess doesn't run on -current anymore, Update
Message-ID:  <XFMail.980330120245.shimon@simon-shapiro.org>
In-Reply-To: <Pine.BSF.3.96.980330084814.485S-100000@fallout.campusview.indiana.edu>

next in thread | previous in thread | raw e-mail | index | archive | help

On 30-Mar-98 John Fieber wrote:
 ...

> It has been well established for many years by professionals in
> database R&D that traditional a RDBMS are utterly and completely
> the wrong tool for free text searching.  This turns out to be
> true even for some relatively structured data types like
> bibliographic records.

I made a descent carreer building systems that `established professionals''
said could not be built.  We can discuss some of these privately :-)
Having said that, you are probably right, to a degree.
The way around it is NOT to search free text in the database.

> There *are* some tasks in a real-world applications that are
> RDBMS type things--a message-id based thread index is simple to
> implement for instance--so I'm all for hybrid systems.  The big
> RDBMS vendors usually have some optional module optimized for
> free-text searching module and some SQL extensions to access it.
> I've pondered writing such a module for postgres, but don't
> really know enough about extending postgres to know how well it
> would work. 

This is what I had in mind exactly.  To normalize what can be normalized,
and leave the rest of it as text.  The problem, in a UFS, is that when the
number of files in the filesystem grows, directory searches become very
costly.  The mail archive (as secondary as it may appear) is an opportunity
to investigate these issues.  With million messages split across several
dozens directories (unless you has the message IDs into the lists, etc.),
you should be seeing some performance dgradation in open(2), which does
directory scans.

How about putting the message body as TEXT datatype into the RDBMS.  At
least you can query it by some integer index.  This means you can use a
B-Tree to find the message, rather than dirscan.

If the message is in a blob, applying regex to it, from within the database
can be optimized.

Another option you mention, and Postgres is IDEAL for that, is a new,
native data type.  Search logic can then be applied, and even integrated
with the system.  Something to think about.

Simon


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.980330120245.shimon>