Catching Spiders with Bottrap

March 4th, 2002 by Andrew Moore in

The author introduces Bottrap, a mod_perl module designed to cut off crawlers and bots.
Your rating: None

Spiders of all kinds often crawl my web sites, and not all of them obey the Robots Exclusion Protocol. Some of them are poorly implemented and are simply ignorant of the robots.txt file. Others are inherently evil and look at robots.txt to find places to crawl and look for valuable information. Often they are crawling for e-mail addresses to which they can send spam. After reading an interesting article at evolt.org about a CGI to keep out bad spiders, I got the idea to write a mod_perl module to do this job. The following is a description of that module, Bottrap.

Bottrap allows you to set up a honeypot directory in your web server. This is a directory that doesn't contain any web pages and that no valued user would need to access. If a client requests a page from that directory, access is denied. Moreover, that client is forbidden from retrieving other documents on your web server for a period of time. This block should be enough to make it difficult, if not impossible, to crawl your web site looking for e-mail addresses to harvest or other valuable information.

Since Bottrap is a mod_perl module, it requires that you have mod_perl built into your Apache installation. You include it in your web server as you would any other perl module, with something like:

PerlModule Bottrap

in your httpd.conf or .htaccess file. Also, you need to specify a honeypot directory, set a time out to keep clients banned (optional) and set Bottrap to be your PerlAccessHandler.

<Location>
   PerlSetVar BotTrapDir /bottrap
   PerlSetVar BotTrapTimeout 600
   PerlAccessHandler Bottrap
</Location>

Reload your web server and Bottrap should start working. To test it out, you can try to access a page in your honeypot with a browser, then try to access another part of your site. If you are denied access to both pages, it is working. In a few minutes, you should be allowed back in. (You can restart your web server to clear the banlist.) Bottrap identifies a single client by a combination of IP address and user-agent id. This should reduce the chances of entire proxies or cache machines from being banned and keeping out innocent users. When these clients start changing their user-agents with each request, this method will have to be revisited.

To draw spiders into the honeypot, you can list the honeypot in your robots.txt file as a place to not go, making sure the exceptionally bad spiders go there to harvest e-mail addresses. Here is what a sample robots.txt may look like:

User-agent: *
Disallow: /bottrap

You can also make an invisible link to it in your pages, like this:

<a href="/bottrap/index.html"></a>

Notice that there is no place to click on that link. A normal browser will never find that link, but a spider might visit and get caught in the honeypot, even though robots.txt tells it to keep out.

The code for Bottrap is available for download and is also shown in Listing 1. While it should work in a wide variety of environments, since it deals with restricting access to your site, I recommend that you test your installation before relying on it in important production environments.

Listing 1. Bottrap mod_perl

I encourage you to let me know if you use the module. I intend to continue development on it if there is interest, so feature suggestions and bug reports are welcome. Some possible improvements include returning a more informative page to deny access, logging any banned clients and other means of notifying the administrator.

__________________________


Special Magazine Offer -- Free Gift with Subscription
Receive a free digital copy of Linux Journal's System Administration Special Edition as well as instant online access to current and past issues. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Daniel M. Webb's picture

A similar implementation in PHP

On December 2nd, 2006 Daniel M. Webb (not verified) says:

I have written something similar in PHP. The main differences are that the ban is permanent (using .htaccess) but has a form for unbanning in the case of innocent humans. It's available at http://danielwebb.us/software/bot-trap/.
Your implementation would probably be easier to harden against a DOS attack by an angry spammer who had access to a large zombie network.

Anonymous's picture

Re: Catching Spiders with Bottrap

On March 6th, 2002 Anonymous says:

I ran a similar program, called sugarplum. One industrious client got very excited, and crawled through the fake pages at top speed, filling the /var partition with the log file and its dbm database used for tracking clients...

Post new comment

Please note that comments may not appear immediately, so there is no need to repost your comment.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.

More information about formatting options

Newsletter

Each week Linux Journal editors will tell you what's hot in the world of Linux. You will receive late breaking news, technical tips and tricks, and links to in-depth stories featured on www.linuxjournal.com.
Sign up for our Email Newsletter

Tech Tip Videos

From the Magazine

December 2009, #188

If last month's Infrastrucuture issue was too "big" for you then try on this month's Embedded issue. Find out how to use Player for programming mobile robots, build a humidity controller for your root cellar, find out how to reduce the boot time of your embedded system, and if you're new to embedded systems find out the basics that go into one. You can also read about the Beagle Board, the Mesh Potato and a spate of other interestingly named items. And along with our regular columns don't miss our new monthly column: Economy Size Geek.







Read this issue