Catching Spiders with Bottrap
March 4th, 2002 by Andrew Moore in
Spiders of all kinds often crawl my web sites, and not all of them obey the Robots Exclusion Protocol. Some of them are poorly implemented and are simply ignorant of the robots.txt file. Others are inherently evil and look at robots.txt to find places to crawl and look for valuable information. Often they are crawling for e-mail addresses to which they can send spam. After reading an interesting article at evolt.org about a CGI to keep out bad spiders, I got the idea to write a mod_perl module to do this job. The following is a description of that module, Bottrap.
Bottrap allows you to set up a honeypot directory in your web server. This is a directory that doesn't contain any web pages and that no valued user would need to access. If a client requests a page from that directory, access is denied. Moreover, that client is forbidden from retrieving other documents on your web server for a period of time. This block should be enough to make it difficult, if not impossible, to crawl your web site looking for e-mail addresses to harvest or other valuable information.
Since Bottrap is a mod_perl module, it requires that you have mod_perl built into your Apache installation. You include it in your web server as you would any other perl module, with something like:
PerlModule Bottrap
in your httpd.conf or .htaccess file. Also, you need to specify a honeypot directory, set a time out to keep clients banned (optional) and set Bottrap to be your PerlAccessHandler.
<Location> PerlSetVar BotTrapDir /bottrap PerlSetVar BotTrapTimeout 600 PerlAccessHandler Bottrap </Location>
Reload your web server and Bottrap should start working. To test it out, you can try to access a page in your honeypot with a browser, then try to access another part of your site. If you are denied access to both pages, it is working. In a few minutes, you should be allowed back in. (You can restart your web server to clear the banlist.) Bottrap identifies a single client by a combination of IP address and user-agent id. This should reduce the chances of entire proxies or cache machines from being banned and keeping out innocent users. When these clients start changing their user-agents with each request, this method will have to be revisited.
To draw spiders into the honeypot, you can list the honeypot in your robots.txt file as a place to not go, making sure the exceptionally bad spiders go there to harvest e-mail addresses. Here is what a sample robots.txt may look like:
User-agent: * Disallow: /bottrap
You can also make an invisible link to it in your pages, like this:
<a href="/bottrap/index.html"></a>
Notice that there is no place to click on that link. A normal browser will never find that link, but a spider might visit and get caught in the honeypot, even though robots.txt tells it to keep out.
The code for Bottrap is available for download and is also shown in Listing 1. While it should work in a wide variety of environments, since it deals with restricting access to your site, I recommend that you test your installation before relying on it in important production environments.
I encourage you to let me know if you use the module. I intend to continue development on it if there is interest, so feature suggestions and bug reports are welcome. Some possible improvements include returning a more informative page to deny access, logging any banned clients and other means of notifying the administrator.
Special Magazine Offer -- Free Gift with Subscription
Receive a free digital copy of Linux Journal's System Administration Special Edition as well as instant online access to current and past issues. CLICK HERE for offer
Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.
Subscribe now!
The Latest
Newsletter
Tech Tip Videos
- Jul-01-09
- Jun-29-09
Recently Popular
From the Magazine
July 2009, #183
News Flash: Linux Kernel 3.0 to include an on-the-go Expresso machine interface! Ok, maybe not, but Linux is definitely going mobile, from phones to e-readers. Find out more inside about Android, the Kindle 2, the Western Digital MyBook II, The Bug, and Indamixx (a portable recording studio). And if you've gone mobile and you been wanting more Emacs in your life then check out Conkeror.
To compliment the mobile we've got the stationary: parsing command line options with getopt, checking your Ruby code with metric_fu, and building a secure Squid proxy. How is this stationary you ask? What can we say? It's not. We just wanted to see if anybody actually read this part of the page :) .
All this and more, and all you have to do is get your hot sweaty hands on the latest copy of Linux Journal.
Delicious
Digg
StumbleUpon
Reddit
Facebook








A similar implementation in PHP
On December 2nd, 2006 Daniel M. Webb (not verified) says:
I have written something similar in PHP. The main differences are that the ban is permanent (using .htaccess) but has a form for unbanning in the case of innocent humans. It's available at http://danielwebb.us/software/bot-trap/.
Your implementation would probably be easier to harden against a DOS attack by an angry spammer who had access to a large zombie network.
Re: Catching Spiders with Bottrap
On March 6th, 2002 Anonymous says:
I ran a similar program, called sugarplum. One industrious client got very excited, and crawled through the fake pages at top speed, filling the /var partition with the log file and its dbm database used for tracking clients...
Post new comment