Quantcast
Username/Email:  Password: 

Catching Spiders with Bottrap

The author introduces Bottrap, a mod_perl module designed to cut off crawlers and bots.

Spiders of all kinds often crawl my web
sites, and not all of them obey the
Robots
Exclusion Protocol
. Some of them are poorly implemented and
are simply ignorant of the robots.txt file. Others are inherently
evil and look at robots.txt to find places to crawl and look for
valuable information. Often they are crawling for e-mail addresses
to which they can send spam. After reading an interesting article
at
evolt.org
about a CGI to keep out bad spiders, I got the idea to write a
mod_perl module to do this job. The following is a description of
that module, Bottrap.Bottrap allows you to set up a honeypot directory in your web
server. This is a directory that doesn't contain any web pages and
that no valued user would need to access. If a client requests a
page from that directory, access is denied. Moreover, that client
is forbidden from retrieving other documents on your web server for
a period of time. This block should be enough to make it difficult,
if not impossible, to crawl your web site looking for e-mail
addresses to harvest or other valuable information.Since Bottrap is a mod_perl module, it requires that you have
mod_perl built into your Apache installation. You include it in
your web server as you would any other perl module, with something
like:PerlModule Bottrapin your httpd.conf or .htaccess file. Also, you need to
specify a honeypot directory, set a time out to keep clients banned
(optional) and set Bottrap to be your PerlAccessHandler.

<Location>
   PerlSetVar BotTrapDir /bottrap
   PerlSetVar BotTrapTimeout 600
   PerlAccessHandler Bottrap
</Location>

Reload your web server and Bottrap should start working. To
test it out, you can try to access a page in your honeypot with a
browser, then try to access another part of your site. If you are
denied access to both pages, it is working. In a few minutes, you
should be allowed back in. (You can restart your web server to
clear the banlist.) Bottrap identifies a single client by a
combination of IP address and user-agent id. This should reduce the
chances of entire proxies or cache machines from being banned and
keeping out innocent users. When these clients start changing their
user-agents with each request, this method will have to be
revisited.To draw spiders into the honeypot, you can list the honeypot
in your robots.txt file as a place to not go, making sure the
exceptionally bad spiders go there to harvest e-mail addresses.
Here is what a sample robots.txt may look like:

User-agent: *
Disallow: /bottrap

You can also make an invisible link to it in your pages, like
this:<a href="/bottrap/index.html"></a>Notice that there is no place to click on that link. A normal
browser will never find that link, but a spider might visit and get
caught in the honeypot, even though robots.txt tells it to keep
out.The code for Bottrap is
available
for download
and is also shown in Listing 1. While it
should work in a wide variety of environments, since it deals with
restricting access to your site, I recommend that you test your
installation before relying on it in important production
environments.Listing 1. Bottrap
mod_perl
I encourage you to let me know if you use the module. I
intend to continue development on it if there is interest, so
feature suggestions and bug reports are welcome. Some possible
improvements include returning a more informative page to deny
access, logging any banned clients and other means of notifying the
administrator.

email: amoore@gotany.com

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

A similar implementation in PHP

Daniel M. Webb's picture

I have written something similar in PHP. The main differences are that the ban is permanent (using .htaccess) but has a form for unbanning in the case of innocent humans. It's available at http://danielwebb.us/software/bot-trap/.
Your implementation would probably be easier to harden against a DOS attack by an angry spammer who had access to a large zombie network.

Re: Catching Spiders with Bottrap

Anonymous's picture

I ran a similar program, called sugarplum. One industrious client got very excited, and crawled through the fake pages at top speed, filling the /var partition with the log file and its dbm database used for tracking clients...

Post new comment

  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.
  • Use to create page breaks.

More information about formatting options