Protect Your Blog Against Misbehaving Bots With Apache

I recently glanced over Apache httpd logs for this blog. Among other things, I discovered several bots that were making quite a few useless requests, thus driving load on the machine. It wasn’t a big deal but a matter of principle. If all of us webmasters start paying attention to misbehaving bots and block them, their authors or maintainers might finally learn how to play by the rules - it’s not difficult really.

Let’s say you are running a blog and hosting your own feed. Regular crawler bots from search engines like Google, Yahoo, MSN, etc will always check robots.txt before sending next batch of requests, and they won’t ask for the same URL very frequently.

The second type of bots you are going to deal with, are RSS reader bots. There is no reason for them to hit anything but URL for your feeds. Such bots usually will not check robots.txt. Their user agent string (which you can see if you enable CustomLog /path/to/log combined) will usually have “N subscribers,” which can give you a rough idea how many people subscribe to your feed via this service. These bots may hit you with varying regularity, ranging from half an hour to several hours between requests. Additionally, well behaving RSS reader bots will always include If-Modified-Since in their requests - you will see in your logs that your usual response to these queries is 304 Not Modified, and only once after new post is published you should see 200 OK response.

So what did I see in my logs? First, I noticed a bot requesting / from 2 different IPs every hour, without including If-Modified-Since. Wasteful and negligent. If this bot does not know how to appreciate my server resources, I am sure it won’t miss my content - block!

<Location />
order allow,deny
allow from all
# BadBot1 is from its User-Agent string
SetEnvIf User-Agent BadBot1 DenyBot=1
deny from env=DenyBot
</Location>

Then I noticed several RSS reader bots that were requesting /feed/ way too frequently (every 10 minutes for 1 subscriber and every 60 minutes for one susbcriber) and were very inconsistent with If-Modified-Since - I couldn’t detect logic in their requests, but sometimes I saw 304 and sometimes I saw 200 response, even when there was no new content on my site.

I didn’t feel right blocking these altogether, so instead what I did is I opened a particular 1 hour window during the night when I allow such bots to get the feed’s content - all other times, their requests are blocked.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} .*BadBot2.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*BadBot3.*
RewriteCond %{REQUEST_URI} /feed/
RewriteCond /etc/allow_bad_bots !-f
RewriteRule . - [forbidden]

Voila! If file /etc/allow_bad_bots exists (I create and delete it from cron, it exists on my system between 1am and 2am), requests from these bots will succeed. During the rest of the day, these rude bots are getting 403 Forbidden.

Categories: linux |

Comments (1)

Derek // 27 Feb 2009

Thanks so much for this post, this helped me out.