From owner-freebsd-questions@FreeBSD.ORG Wed Nov 5 10:17:03 2008 Return-Path: Delivered-To: questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0BBBF1065691 for ; Wed, 5 Nov 2008 10:17:03 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA07.westchester.pa.mail.comcast.net (qmta07.westchester.pa.mail.comcast.net [76.96.62.64]) by mx1.freebsd.org (Postfix) with ESMTP id A8C8E8FC19 for ; Wed, 5 Nov 2008 10:17:02 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from OMTA07.westchester.pa.mail.comcast.net ([76.96.62.59]) by QMTA07.westchester.pa.mail.comcast.net with comcast id bNDx1a0061GhbT857NGL8p; Wed, 05 Nov 2008 10:16:20 +0000 Received: from koitsu.dyndns.org ([69.181.141.110]) by OMTA07.westchester.pa.mail.comcast.net with comcast id bNGR1a0082P6wsM3TNGSgu; Wed, 05 Nov 2008 10:16:50 +0000 X-Authority-Analysis: v=1.0 c=1 a=DpYL8BXZZmYA:10 a=hTXgk7_0KgIA:10 a=D58XV3euAAAA:8 a=yuk-rqkyAAAA:8 a=QycZ5dHgAAAA:8 a=MF8n0YqrfFzN_X-rveUA:9 a=8g2mXkMZ2R4m8VW8WlwA:7 a=vqzFtL_qmZMyC6O8vyPWbIsufsAA:4 a=EoioJ0NPDVgA:10 a=LY0hPdMaydYA:10 Received: by icarus.home.lan (Postfix, from userid 1000) id 6199BC943C; Wed, 5 Nov 2008 02:16:25 -0800 (PST) Date: Wed, 5 Nov 2008 02:16:25 -0800 From: Jeremy Chadwick To: Ian Smith Message-ID: <20081105101625.GA6494@icarus.home.lan> References: <20081105170631.O70117@sola.nimnet.asn.au> <20081105072752.GA4079@icarus.home.lan> <20081105194002.N70117@sola.nimnet.asn.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20081105194002.N70117@sola.nimnet.asn.au> User-Agent: Mutt/1.5.18 (2008-05-17) Cc: questions@FreeBSD.org Subject: Re: Apache environment variables - logical AND X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Nov 2008 10:17:03 -0000 On Wed, Nov 05, 2008 at 08:24:16PM +1100, Ian Smith wrote: > On Tue, 4 Nov 2008, Jeremy Chadwick wrote: > > On Wed, Nov 05, 2008 at 05:33:45PM +1100, Ian Smith wrote: > > > I know this isn't FreeBSD specific - but I am, so crave your indulgence. > > > > > > Running Apache 1.3.27, using a fairly extensive access.conf to beat off > > > the most rapacious robots and such, using mostly BrowserMatch[NoCase] > > > and SetEnvIf to moderate access to several virtual hosts. No problem. > > > > > > OR conditions are of course straighforward: > > > > > > SetEnvIf somevar > > > SetEnvIf somevar > > > SetEnvIf !somevar > > > > > > What I can't figure out is how to set a variable3 if and only if both > > > variable1 AND variable2 are set. Eg: > > > > > > SetEnvIf Referer "^$" no_referer > > > SetEnvIf User-Agent "^$" no_browser > > > > > > I want the equivalent for this (invalid and totally fanciful) match: > > > > > > SetEnvIf (no_browser AND no_referer) go_away > > > > Sounds like a job for mod_rewrite. The SetEnvIf stuff is such a hack. > > It may be a hack, but I've found it an extremely useful one so far. > > > This is what we use on our production servers (snipped to keep it > > short): > > > > RewriteEngine on > > RewriteCond %{HTTP_REFERER} ^XXXX: [OR] > > RewriteCond %{HTTP_REFERER} ^http://forums.somethingawful.com/ [OR] > > RewriteCond %{HTTP_REFERER} ^http://forums.fark.com/ [OR] > > RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR] > > RewriteCond %{HTTP_USER_AGENT} ^asterias [OR] > > RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR] > > RewriteCond %{HTTP_USER_AGENT} ^Black.Hole [NC,OR] > > RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] > > RewriteCond %{HTTP_USER_AGENT} ^Xaldon.WebSpider > > RewriteRule ^.* - [F,L] > > > > You need to keep something in mind however: blocking by user agent is > > basically worthless these days. Most "leeching" tools now let you > > spoof the user agent to show up as Internet Explorer, essentially > > defeating the checks. > > While that's true, I've found most of the more troublesome robots are > too proud of their 'brand' to spoof user agent, and those that do are a) > often consistent enough in their Remote_Addr to exclude by subnet and/or > b) often make obvious errors in spoofed User_Agent strings .. especially > those pretending to be some variant of MSIE :) I haven't found this to be true at all, and I've been doing web hosting since 1993. In the past 2-3 years, the amount of leeching tools which spoof their User-Agent has increased dramatically. But step back for a moment and look at it from a usability perspective, because this is what really happens. A user tries to leech a site you host, using FruitBatLeecher, which your Apache server blocks based on User-Agent. The user has no idea why the leech program doesn't work. Does the user simply give up his quest? Absolutely not -- the user then goes and finds BobsBandwidthZilla which pretends to be Internet Explorer, Firefox, or lynx, and downloads the site. Now, if you're trying to block robots/scrapers which aren't honouring robots.txt, oh yes, that almost always works, because those rarely spoof their User-Agent (I think to date I've only seen one site which did that, and it was some Russian search engine). If you feel I'm just doing burn-outs arguing, a la "BSD style", let me give you some insight to how often I deal with this problem: daily. We host a very specific/niche site that contains over 20 years of technical information on the Famicom / Nintendo Entertainment System. The site has hundreds of megabytes of information, and a very active forum. Some jackass comes along and decides "Wow, this has all the info I want!" and fires off a leeching program against the entire domain/vhost. Let's say the program he's using is blocked by our User-Agent blocks; there is a 6-7 minute delay as the user goes off to find another program to leech with, installs it, and attempts it again. Pow, it works, and we find nice huge spikes in our logs for the vhost indicating someone got around it. I later dig through our access_log and find that he tried to use FruitBatLeecher, which got blocked, but then 6-7 minutes later came back with a leeching client that spoofs itself as IE. And it gets worse. Many of these leeching programs get stuck in infinite loops when it comes to forum software, so they sit there pounding on the webserver indefinitely. It requires administrator intervention to stop it; in my case, I don't even bother with Apache ACLs, because ~70% of the time the client ignores 403s and keeps bashing away (yes really!) -- I go straight for a pf-based block in a table called . These guys will hit that block for *days* -- that should give you some idea how long they'll let that program run. But it gets worse -- again. Recently, I found two examples of very dedicated leechers. One was an individual out of China (or using Chinese IPs -- take your pick), and another was at an Italian university. These individuals got past the User-Agent blocks, and I caught their leeching software stuck in a loop on the site forum. I blocked their IPs with pf, thinking it would be enough, then went to sleep. I woke up the following evening to find they were back at it again. How? The Chinese individual literally got another IP somehow, in a completely different netblock; possibly a DHCP release/renew, possibly some friend of his, whatever. The Italian university individual was successful in his leech attempts exactly 50% of the time -- because their university used a transparent HTTP proxy that was balanced between two IPs. I had only blocked one of them. Starting to get the picture now? :-) The only effective way to deal with all of this is rate-limiting. I do not advocate "queues" or "buckets", or "dynamic buckets" where each IP is allocated X number of simultaneous sockets, and if they exceed that, they get rate-limited. I also do not advocate "shared queues", where if there are X number of sockets, allow Z amount of bandwidth, but if X is more than, say, 200 sockets, allow Z/2 amount of bandwidth. The tuning is simply not worth it -- people will go to great lengths to screw you. And if your stuff is in a 95th-percentile billing environment, believe me, you DO NOT want to wake up one morning to find that someone has cost you thousands of dollars. Also, I recommend using ipfw dummynet or pf ALTQ for rate-limiting. The few Apache bandwidth-limiting modules I've tried have bizarre side effects. Here's a forum post of mine (on the above site) explaining why we moved away from mod_cband and went with pf ALTQ. http://nesdev.parodius.com/bbs/viewtopic.php?t=4184 > > If you're that concerned about bandwidth (which is why a lot of people > > do the above), consider rate-limiting. It's really, quite honestly, the > > only method that is fail-safe. > > Thanks Jeremy. Certainly time to take the time to have another look at > mod_rewrite, especially regarding redirection, alternative pages etc, > but I still tend to glaze over about halfway through all that section. Yeah, I agree, the mod_rewrite documentation is overwhelming, and that turns a lot of people off. The examples I gave you should allow you to look up each piece of the directive at a time, and once you do that, it'll all make sense. > And unless I've completely missed it, your examples don't address my > question, being how to AND two or more conditions in a particular test? > > If I really can't do this with mod_setenvif I'll have to take that time. You can't do it with mod_setenvif. You can do it with mod_rewrite, because all mod_rewrite rules default to an operator type of "AND". The [OR] you see in my rules is an explicit override for obvious reasons. Open the Apache 1.3 mod_rewrite docs and search for "implicit AND". It'll all make sense then. :-) I hope some of what I've said above gives you something to think about. Hosting environments are a real pain in the ass; when it's "just you and your own personal box" it's easy, but when it's larger scale and involves users (customers or friends, doesn't matter), it's a totally different game. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |