Catching Spiders with Bottrap
Spiders of all kinds often crawl my web sites, and not all of them obey the Robots Exclusion Protocol. Some of them are poorly implemented and are simply ignorant of the robots.txt file. Others are inherently evil and look at robots.txt to find places to crawl and look for valuable information. Often they are crawling for e-mail addresses to which they can send spam. After reading an interesting article at evolt.org about a CGI to keep out bad spiders, I got the idea to write a mod_perl module to do this job. The following is a description of that module, Bottrap.
Bottrap allows you to set up a honeypot directory in your web server. This is a directory that doesn't contain any web pages and that no valued user would need to access. If a client requests a page from that directory, access is denied. Moreover, that client is forbidden from retrieving other documents on your web server for a period of time. This block should be enough to make it difficult, if not impossible, to crawl your web site looking for e-mail addresses to harvest or other valuable information.
Since Bottrap is a mod_perl module, it requires that you have mod_perl built into your Apache installation. You include it in your web server as you would any other perl module, with something like:
in your httpd.conf or .htaccess file. Also, you need to specify a honeypot directory, set a time out to keep clients banned (optional) and set Bottrap to be your PerlAccessHandler.
<Location> PerlSetVar BotTrapDir /bottrap PerlSetVar BotTrapTimeout 600 PerlAccessHandler Bottrap </Location>
Reload your web server and Bottrap should start working. To test it out, you can try to access a page in your honeypot with a browser, then try to access another part of your site. If you are denied access to both pages, it is working. In a few minutes, you should be allowed back in. (You can restart your web server to clear the banlist.) Bottrap identifies a single client by a combination of IP address and user-agent id. This should reduce the chances of entire proxies or cache machines from being banned and keeping out innocent users. When these clients start changing their user-agents with each request, this method will have to be revisited.
To draw spiders into the honeypot, you can list the honeypot in your robots.txt file as a place to not go, making sure the exceptionally bad spiders go there to harvest e-mail addresses. Here is what a sample robots.txt may look like:
User-agent: * Disallow: /bottrap
You can also make an invisible link to it in your pages, like this:
Notice that there is no place to click on that link. A normal browser will never find that link, but a spider might visit and get caught in the honeypot, even though robots.txt tells it to keep out.
The code for Bottrap is available for download and is also shown in Listing 1. While it should work in a wide variety of environments, since it deals with restricting access to your site, I recommend that you test your installation before relying on it in important production environments.
I encourage you to let me know if you use the module. I intend to continue development on it if there is interest, so feature suggestions and bug reports are welcome. Some possible improvements include returning a more informative page to deny access, logging any banned clients and other means of notifying the administrator.
email: [email protected]