Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 Sep 2002 23:11:09 +0100
From:      Ceri Davies <ceri@FreeBSD.org>
To:        Mike Thompson <mike@atomz.com>
Cc:        Michael Lucas <mwlucas@FreeBSD.org>, Wolfram Schneider <wosch@FreeBSD.ORG>, freebsd-doc@FreeBSD.ORG, fran@atomz.com
Subject:   Re: Improved searching for FreeBSD.org web site
Message-ID:  <20020917221109.GA13675@submonkey.net>
In-Reply-To: <4.3.2.7.2.20020917131047.00ad6ee0@pop.atomz.com>
References:  <4.3.2.7.2.20020906155438.00e4a2d0@pop.atomz.com> <4.3.2.7.2.20020822220010.00ab2540@pop.atomz.com> <4.3.2.7.2.20020822220010.00ab2540@pop.atomz.com> <20020905123434.A15857@blackhelicopters.org> <4.3.2.7.2.20020906155438.00e4a2d0@pop.atomz.com> <4.3.2.7.2.20020917131047.00ad6ee0@pop.atomz.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Sep 17, 2002 at 02:28:57PM -0700, Mike Thompson wrote:
> Hi Michael,

[wosch: please skip to item #3]

Hi Mike,

> 1. Categorized searches across FreeBSD web site

I'll let someone else answer this as I consider myself shortsighted in
this area.

> 2. Searching the list archives
> 
> The mailing lists will prove to be a bit of a challenge to properly 
> index.  The reason is that it will likely take our search engine several 
> days to crawl and index the full email archives of nearly a million email 
> documents.  It may make more sense to store a local copy of the email 
> archives at Atomz and then only crawl the latest archives from the actual 
> FreeBSD.org web servers.  Our search engine will properly redirect the 
> search result URLs to the proper content on the real FreeBSD web 
> site.  This way we can crawl and index the local copy of the email archives 
> at fast network speeds and lesson the drain our crawler would place on the 
> FreeBSD servers and the getmsg.cgi script.  I'll need to investigate this a 
> little further.

If you decide a local copy is the way to go then I'm sure all we'll need is
an FTP server and an account to upload these files to. There are just over
920MB in the archive at the moment.

> Also, it would be very beneficial if the getmsg.cgi script for viewing 
> email messages were modified to put the following meta tags into the HTML 
> header for each email message:
> 
> <meta name="list" content="freebsd-questions">
> <meta name="subject" content="4.5 install problem">
> <meta name="date" content="Wed, 6 Feb 2002 10:09:51 -0500">
> <meta name="from" content="Jeff Aitken <jaitken@aitken.com>">

Sounds fine - I can see this being of general use as well.

> If I made the changes to the getmsg.cgi script, is this something that I 
> would be able to get checked into the CVS tree so this meta information 
> appears in future versions of the email archives?

For sure.

> 3. HTTP Error 403 - Forbidden on certain FreeBSD.org URLs
> 
> Our search engine is having problems crawling URLs such as the following 
> description of the ncftpd port -- an HTTP 403 error of forbidden is 
> returned.
> 
> http://www.FreeBSD.org/cgi/url.cgi?ports/ftp/ncftpd/pkg-descr
> 
> Are there known aspects of the FreeBSD.org web site that prevent robots 
> from crawling certain portions of the web site?

Yes, there is.  This a result of us using robots.txt to disallow robots
indexing anything with /cgi and using redirects in httpd.conf to enforce
this. Unfortunately this is exactly what you need to do.

wosch, could we take a look at explicitly allowing Atomz to access this part
of the site ?

> That's where I'm at for now.  Hopefully I'll have something to show pretty 
> quickly.

Excellent - thanks again.

Ceri

-- 
you can't see when light's so strong
you can't see when light is gone

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020917221109.GA13675>