I Didn't Touch this Perl Script, but It Broke

It worked in 1997, but it's not good enough for 2002. The answer to a reader's question about a script from an old issue of Linux Journal.

You're trying to use a Perl script from
an old issue of Linux Journal, and it isn't
working out. Well, this is as good an opportunity as any to see how
things you don't touch still can break.Here's the original script,
which comes from an article in issue 40 titled "A Web Crawler in
Perl". Don't forget that code from old issues of Linux
Journal
also is available from the
SSC FTP
site
. The listings from issue 40 are
in their own
directory
, and you can save time by getting the code from
there, instead of removing the PHP-generated trimmings from the
HTML version.The script has two other problems. The first problem is the
script will exit silently if the first argument--$ARGV[1]--is
anything other than a correctly formatted URL. For people used to
telling their browsers to go to linuxjournal.com instead of
http://linuxjournal.com/, this could be puzzling, especially if you
give the wrong number of arguments. If you do, the script will be
helpful and prompt you with a usage message, the second
problem.Offering help if the user makes one mistake and silently
exiting if the user makes a different mistake are not good ideas in
production code. But this is an example from the back of a
magazine, and nobody actually runs back-of-the-magazine code
without putting in a bunch of sanity checks that would be boring in
a magazine but booty-saving in real use. Right?Now for the interesting problem. Look at this line:print S "GET /$document
HTTP/1.0\n\n";
We're sending a one-line HTTP 1.0 request. How well does this
work? It will do fine if there's only one web site on the server.
But today, many sites merge a lot of virtual hosts into one IP
address. As an example, let's try a one-line GET request like this
on the server
dmarti.weblogs.com,
which is one virtual host on a site that hosts many:

$ telnet dmarti.livejournal.com 80
Trying 66.150.15.150...
Connected to livejournal.com.
Escape character is '^]'.
GET /Connection closed by foreign host.
dmarti@zingiber:~$ GET / HTTP/1.0
<HTML>
<HEAD>
<TITLE>Directory /</TITLE>
<BASE HREF="file:/">
</HEAD>
<BODY>
<H1>Directory listing of /</H1>

And so on. Oops. Looks like we need a Host: header, which
came in officially with HTTP 1.1 but will work in HTTP 1.0
requests. Change that GET line above to:

print S "GET /$document HTTP/1.0\n";
print S "Host: $server_host\n\n";

And voilà! It works. The Host:
header tells the server from which virtual host to get the
page.The Perl script broke for some sites because people's
assumptions about the web changed, and the HTTP protocol was
updated to reflect that. Through no fault of your own, you had to
go back and change it. That's life on the Internet.Don Marti is editor in chief
of Linux Journal.

email: dmarti@ssc.com

Geek Guide
The DevOps Toolbox

Tools and Technologies for Scale and Reliability
by Linux Journal Editor Bill Childers

Get your free copy today

Sponsored by IBM

Upcoming Webinar
8 Signs You're Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th
Moderated by Linux Journal Contributor Mike Diehl

Sign up now

Sponsored by Skybot