I Didn't Touch this Perl Script, but It Broke

It worked in 1997, but it's not good enough for 2002. The answer to a reader's question about a script from an old issue of Linux Journal.

You're trying to use a Perl script from
an old issue of Linux Journal, and it isn't
working out. Well, this is as good an opportunity as any to see how
things you don't touch still can break.Here's the original script,
which comes from an article in issue 40 titled "A Web Crawler in
Perl". Don't forget that code from old issues of Linux
Journal
also is available from the
SSC FTP
site
. The listings from issue 40 are
in their own
directory
, and you can save time by getting the code from
there, instead of removing the PHP-generated trimmings from the
HTML version.The script has two other problems. The first problem is the
script will exit silently if the first argument--$ARGV[1]--is
anything other than a correctly formatted URL. For people used to
telling their browsers to go to linuxjournal.com instead of
http://linuxjournal.com/, this could be puzzling, especially if you
give the wrong number of arguments. If you do, the script will be
helpful and prompt you with a usage message, the second
problem.Offering help if the user makes one mistake and silently
exiting if the user makes a different mistake are not good ideas in
production code. But this is an example from the back of a
magazine, and nobody actually runs back-of-the-magazine code
without putting in a bunch of sanity checks that would be boring in
a magazine but booty-saving in real use. Right?Now for the interesting problem. Look at this line:print S "GET /$document
HTTP/1.0\n\n";
We're sending a one-line HTTP 1.0 request. How well does this
work? It will do fine if there's only one web site on the server.
But today, many sites merge a lot of virtual hosts into one IP
address. As an example, let's try a one-line GET request like this
on the server
dmarti.weblogs.com,
which is one virtual host on a site that hosts many:

$ telnet dmarti.livejournal.com 80
Trying 66.150.15.150...
Connected to livejournal.com.
Escape character is '^]'.
GET /Connection closed by foreign host.
dmarti@zingiber:~$ GET / HTTP/1.0
<HTML>
<HEAD>
<TITLE>Directory /</TITLE>
<BASE HREF="file:/">
</HEAD>
<BODY>
<H1>Directory listing of /</H1>

And so on. Oops. Looks like we need a Host: header, which
came in officially with HTTP 1.1 but will work in HTTP 1.0
requests. Change that GET line above to:

print S "GET /$document HTTP/1.0\n";
print S "Host: $server_host\n\n";

And voilà! It works. The Host:
header tells the server from which virtual host to get the
page.The Perl script broke for some sites because people's
assumptions about the web changed, and the HTTP protocol was
updated to reflect that. Through no fault of your own, you had to
go back and change it. That's life on the Internet.Don Marti is editor in chief
of Linux Journal.

email: dmarti@ssc.com

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix