Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 8 Jan 1996 20:05:41 +0200 (EET)
From:      "Andrew V. Stesin" <stesin@elvisti.kiev.ua>
To:        chuckr@glue.umd.edu (Chuck Robey)
Cc:        ports@freebsd.org
Subject:   Re: making ports
Message-ID:  <199601081805.UAA12117@office.elvisti.kiev.ua>
In-Reply-To: <Pine.SUN.3.91.960107102808.693C-100000@cappuccino.eng.umd.edu> from "Chuck Robey" at Jan 7, 96 10:29:44 am

next in thread | previous in thread | raw e-mail | index | archive | help
Hello again,

# > 	Isn't Glimpse only a part of a whole lot bigger Harvest-1.4pl1
# > 	distribution now?
# > 	I'm just compiling Harvest in order to learn this pretty complex
# > 	thing. It's companion, cached-1.4pl0  (proxy HTTP caching daemon)
# > 	is waiting here, too.
# > 
# > 	(ftp://ftp.cs.colorado.edu/pub/distribs/harvest)
# > 
# > -- 
# 
# It may be, but that's not immediately obvious from the glimpse side, at 
# least to me.  If you're doing harvest, you'll probably find that out for 
# us.  Glimpse seems to be a general purpose text search engine ... is 
# Harvest something that has been specialized for web stuff?

	Not only for web -- just for everything!

	Harvest in overall is pretty complex, but (as for my opinion)
	has really powerful high-level design, based on good ideas.
	Just now I'm printing and reading a couple of techreports on
	Harvest and it's User Manual (all in .ps, that's sloooow on 9pin
	Epson :) in order to catch more details. That's what I've figure
	for now, very shortly -- if one is interested, all this info
	is available electronically.

	The design is 2-level:

	1. "Gatherer"-like tools. Their purpose is to extract relevant
	   information from different sources with a tunable degree
	   of detaileness. "Different" here means that files to be
	   processed may be 

	   a)  accessed locally or remotely from a set of servers,
	       public or private ones, via FTP, HTTP, NNTP and whatever else;
	   b)  of different formats (including .ps, SGML, HTML as SGML subset,
	       RCS/CVS, .o, netnews, e-mail archives, even .gif,
	       and _many_ others; one can
	       add new custom file types. I.e. the tool to convert
	       WordPerfect files to smth like RTF than to SGML and than
	       make "juice" from it is available, too).

	   In brief: Gatherer makes X litres of juice from Y tons of oranges :-)
	   Y is much greater than X.

	   The Gatherer wich comes with Harvest is based on a tool named
	   Essence, developed by Hardy and Shwartz at cs.colorado.edu.
	   It is capable to make "juice" even from "nested" things,
	   like .tgz files comtaining any of the above formats!

	   How high the percent of juice is?
	   It's tunable; you may set it into "full" mode, when nothing is
	   lost, or in some less detailed mode when it drops
	   some words. Techreport on Essence has comparisons with WAIS's
	   content capturing efficiency and index sizes; Essence is better
	   (I'd going to beleive this without precise testing -- I tried
	   WAIS and wasn't happy with it).

	   All this makes the idea to install a gatherer on a _big_ FTP server
	   (like ftp.cdrom.com :) pretty attractive, and see below.

	   I'm going to look deeper into Gatherer/Essence soon
	   and try to figure --
	   why do they use GNU dbm library (and packaged it with Harvest)
	   and GNU malloc? I'm not too happy with this idea;
	   I suspect that using Berkeley DB and PHK malloc (thanks, Paul!
	   that's the very best malloc() I've used!) will give some performance
	   benefit.

	2. "Brokers". Broker can collect "juice" from one or more, local
	   or remote "gatherers"; information is retrieved from them
	   via some conventional protocols (FTP?). It than makes an
	   index of what he got, and provides an out-of-the-box WWW
	   binding for searches. Damn cool!

	   Harvest' Broker can use different search engines. Glimpse and WAIS
	   bindings are already present, and there are hooks for others.
	   There is some stuff for use a commercial Topic search engine
	   from Verity Inc. (see http://www.verity.com). I already
	   asked them about the details, hope they will send me some answer.

	   (!) I have an opinion for now that there aren't too many search
	   engines around, neither free, nor commercial. If someone
	   will point me at something other than Glimpse, WAIS and Topic,
	   I'd be very grateful, and will test just every one which is
	   freely available.

	   Glimpse is a "default" engine for Harvest. It's written recently
	   by the author(s) of 'agrep' tool, it's agrep-based and
	   includes it's distribution. Harvest distribution includes
	   a lemon-fresh Glimpse distribution it it.

	   So, as about ftp.cdrom.com: once youv'e a Gatherer working,
	   you may launch a Broker, hook it to www.cdrom.com and voila!
	   one can even find a name of the author of /pub/msdos/virus.exe,
	   hidden somewhere in the executable! :-)

	Another great tool (especially for those who has a LAN with a slow
	link to the world) is Cached -- I'm planning to put it on our
	firewall-gateway which I'm building now step by step.

	So people, I beleive it's cool! and let's make FreeBSD a
	Harvest's platform of choice! :-)

	I'll inform people about my experince with Harvest later when
	I'll get it running. Yes, I'm considering making a port, too;
	but only after I'll become a Harvest expert and shall
	have answers for all of my today's questions.

	BTW: what is the preferred way to handle GNU autoconf when
	making a port for FreeBSD?


-- 

	With best regards -- Andrew Stesin.

	+380 (44) 2760188	+380 (44) 2713457	+380 (44) 2713560

	An undocumented feature is a coding error.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199601081805.UAA12117>