I Didn't Touch this Perl Script, but It Broke

It worked in 1997, but it's not good enough for 2002. The answer to a reader's question about a script from an old issue of Linux Journal.

You're trying to use a Perl script from
an old issue of Linux Journal, and it isn't
working out. Well, this is as good an opportunity as any to see how
things you don't touch still can break.Here's the original script,
which comes from an article in issue 40 titled "A Web Crawler in
Perl". Don't forget that code from old issues of Linux
Journal
also is available from the
SSC FTP
site
. The listings from issue 40 are
in their own
directory
, and you can save time by getting the code from
there, instead of removing the PHP-generated trimmings from the
HTML version.The script has two other problems. The first problem is the
script will exit silently if the first argument--$ARGV[1]--is
anything other than a correctly formatted URL. For people used to
telling their browsers to go to linuxjournal.com instead of
http://linuxjournal.com/, this could be puzzling, especially if you
give the wrong number of arguments. If you do, the script will be
helpful and prompt you with a usage message, the second
problem.Offering help if the user makes one mistake and silently
exiting if the user makes a different mistake are not good ideas in
production code. But this is an example from the back of a
magazine, and nobody actually runs back-of-the-magazine code
without putting in a bunch of sanity checks that would be boring in
a magazine but booty-saving in real use. Right?Now for the interesting problem. Look at this line:print S "GET /$document
HTTP/1.0\n\n";
We're sending a one-line HTTP 1.0 request. How well does this
work? It will do fine if there's only one web site on the server.
But today, many sites merge a lot of virtual hosts into one IP
address. As an example, let's try a one-line GET request like this
on the server
dmarti.weblogs.com,
which is one virtual host on a site that hosts many:

$ telnet dmarti.livejournal.com 80
Trying 66.150.15.150...
Connected to livejournal.com.
Escape character is '^]'.
GET /Connection closed by foreign host.
dmarti@zingiber:~$ GET / HTTP/1.0
<HTML>
<HEAD>
<TITLE>Directory /</TITLE>
<BASE HREF="file:/">
</HEAD>
<BODY>
<H1>Directory listing of /</H1>

And so on. Oops. Looks like we need a Host: header, which
came in officially with HTTP 1.1 but will work in HTTP 1.0
requests. Change that GET line above to:

print S "GET /$document HTTP/1.0\n";
print S "Host: $server_host\n\n";

And voilà! It works. The Host:
header tells the server from which virtual host to get the
page.The Perl script broke for some sites because people's
assumptions about the web changed, and the HTTP protocol was
updated to reflect that. Through no fault of your own, you had to
go back and change it. That's life on the Internet.Don Marti is editor in chief
of Linux Journal.

email: dmarti@ssc.com

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState