From owner-freebsd-doc Tue Sep 17 15:11:13 2002 Delivered-To: freebsd-doc@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6B4E037B401; Tue, 17 Sep 2002 15:11:11 -0700 (PDT) Received: from rhadamanth.submonkey.net (pc2-cdif2-4-cust179.cdf.cable.ntl.com [80.4.11.179]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6400B43E91; Tue, 17 Sep 2002 15:11:10 -0700 (PDT) (envelope-from setantae@submonkey.net) Received: from setantae by rhadamanth.submonkey.net with local (Exim 4.10) id 17rQYn-00062l-00; Tue, 17 Sep 2002 23:11:09 +0100 Date: Tue, 17 Sep 2002 23:11:09 +0100 From: Ceri Davies To: Mike Thompson Cc: Michael Lucas , Wolfram Schneider , freebsd-doc@FreeBSD.ORG, fran@atomz.com Subject: Re: Improved searching for FreeBSD.org web site Message-ID: <20020917221109.GA13675@submonkey.net> Mail-Followup-To: Ceri Davies , Mike Thompson , Michael Lucas , Wolfram Schneider , freebsd-doc@FreeBSD.ORG, fran@atomz.com References: <4.3.2.7.2.20020906155438.00e4a2d0@pop.atomz.com> <4.3.2.7.2.20020822220010.00ab2540@pop.atomz.com> <4.3.2.7.2.20020822220010.00ab2540@pop.atomz.com> <20020905123434.A15857@blackhelicopters.org> <4.3.2.7.2.20020906155438.00e4a2d0@pop.atomz.com> <4.3.2.7.2.20020917131047.00ad6ee0@pop.atomz.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4.3.2.7.2.20020917131047.00ad6ee0@pop.atomz.com> X-message-flag: All your linuxconf-configured redhat are belong to us. User-Agent: Mutt/1.5.1i Sender: owner-freebsd-doc@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Tue, Sep 17, 2002 at 02:28:57PM -0700, Mike Thompson wrote: > Hi Michael, [wosch: please skip to item #3] Hi Mike, > 1. Categorized searches across FreeBSD web site I'll let someone else answer this as I consider myself shortsighted in this area. > 2. Searching the list archives > > The mailing lists will prove to be a bit of a challenge to properly > index. The reason is that it will likely take our search engine several > days to crawl and index the full email archives of nearly a million email > documents. It may make more sense to store a local copy of the email > archives at Atomz and then only crawl the latest archives from the actual > FreeBSD.org web servers. Our search engine will properly redirect the > search result URLs to the proper content on the real FreeBSD web > site. This way we can crawl and index the local copy of the email archives > at fast network speeds and lesson the drain our crawler would place on the > FreeBSD servers and the getmsg.cgi script. I'll need to investigate this a > little further. If you decide a local copy is the way to go then I'm sure all we'll need is an FTP server and an account to upload these files to. There are just over 920MB in the archive at the moment. > Also, it would be very beneficial if the getmsg.cgi script for viewing > email messages were modified to put the following meta tags into the HTML > header for each email message: > > > > > Sounds fine - I can see this being of general use as well. > If I made the changes to the getmsg.cgi script, is this something that I > would be able to get checked into the CVS tree so this meta information > appears in future versions of the email archives? For sure. > 3. HTTP Error 403 - Forbidden on certain FreeBSD.org URLs > > Our search engine is having problems crawling URLs such as the following > description of the ncftpd port -- an HTTP 403 error of forbidden is > returned. > > http://www.FreeBSD.org/cgi/url.cgi?ports/ftp/ncftpd/pkg-descr > > Are there known aspects of the FreeBSD.org web site that prevent robots > from crawling certain portions of the web site? Yes, there is. This a result of us using robots.txt to disallow robots indexing anything with /cgi and using redirects in httpd.conf to enforce this. Unfortunately this is exactly what you need to do. wosch, could we take a look at explicitly allowing Atomz to access this part of the site ? > That's where I'm at for now. Hopefully I'll have something to show pretty > quickly. Excellent - thanks again. Ceri -- you can't see when light's so strong you can't see when light is gone To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message