Working with LWP

This month Mr. Lerner takes a look at the library for web programming and its associated modules.
Examining the response

After typing return a second time, you should see the contents of http://www.lerner.co.il/ returned to you. Once the document has been transferred to your terminal, the connection is terminated. If you want to connect to the same server again, you may do so. However, you will have to issue a new connection and a new request.

Just as the client can send request headers before the request itself, the server can send response headers before the response. As in the case with request headers, there must be a blank line between the response headers and the body of the response.

Here are the headers I received after issuing the above GET request:

HTTP/1.1 200 OK
Date: Thu, 12 Aug 1999 19:36:44 GMT
Server: Apache/1.3.6 (UNIX) PHP/3.0.11 FrontPage/3.0.4.2 Rewrit/1.0a
Connection: close
Content-Type: text/html

The above lines are typical for a response.

The first line produces general information about the response, including an indication of what is yet to come. First, the server tells us it is capable of handling anything up to HTTP/1.1. If we ever want to send a request using HTTP/1.1, this server will allow it. After the HTTP version number comes a response code. This code can indicate a variety of possibilities, including whether everything went just fine (200), the file has moved permanently (301), the file was not found (404), or there was an error on the server side (501).

The numeric code is typically followed by a text message, which gives an indication of the meaning behind the numbers. Apache and other servers might allow us to customize the page displayed when an error occurs, but that customization does not extend to this error code, which is standard and fixed.

Following the error code comes the date on which the response was generated. This header is useful for proxies and caches, which can then store the date of a document along with its contents. The next time your browser tries to retrieve a file, it will compare the Date: header from the previous response, retrieving the new version only if the server's version is newer.

The server identifies itself in the Server: header. In this particular case, the server tells us not only that it is Apache 1.3.6 running under a form of UNIX (in this case, Linux), but also some modules that have been installed. My web-space provider has chosen to install PHP, FrontPage and Rewrit; as we have seen in previous months, mod_perl is another popular module for server-side programming, and one which advertises itself in this header.

As we have seen, an HTTP connection terminates after the server finishes sending its response. This can be extremely inefficient; consider a page of HTML that contains five IMG tags, indicating where images should be loaded. In order to download this page in its entirety, a web browser has to create six separate HTTP connections—one for the HTML and one for each of the images. To overcome this inefficiency, HTTP/1.1 allows for “persistent connections”, meaning that more than one document can be retrieved in a single HTTP transaction. This is signalled with the Connection header, which indicated it was ready to close the connection after a single transaction in the example above.

The final header in the above output is Content-type, well-known to CGI programmers. This header uses a MIME-style description to tell the browser what kind of content to expect. Should it expect HTML-formatted text (text/html)? Or a JPEG image (image/jpeg)? Or something that cannot be identified, which should be treated as binary data (application/octet-stream)? Without such a header, your browser will not know how to treat the data it receives, which is why servers often produce error messages when Content-type is missing.

HTTP/1.0 supports many methods other than GET, but the main ones are GET, HEAD, and POST. GET, as its name implies, allows us to retrieve the contents of a link. This is the most common method, and is behind most of the simple retrievals your web browser performs. HEAD is the same as GET, but quits after printing the response headers. Sending a request of

HEAD / HTTP/1.0

is a good way to test your web server and see if it is running correctly.

POST not only names a path on the server's computer, but also sends input in name,value pairs. (GET can also submit information in name,value pairs, but it is considered less desirable in most situations.) POST is usually invoked when a user clicks on the “submit” button in an HTML form.

LWP::Simple

Now that we have an understanding of the basics behind HTTP, let's see how we can handle requests and responses using Perl. Luckily, LWP contains objects for nearly everything we might want to do, with code tested by many people.

If we simply want to retrieve a document using HTTP, we can do so with the LWP::Simple module. Here, for instance, is a simple Perl program that retrieves the root document from my web site:

#!/usr/bin/perl --w
use strict;
use diagnostics;
use LWP::Simple;
# Get the contents
my $content = get "http://www.lerner.co.il/";
# Print the contents
print $content, "\n";

In this particular case, the startup and diagnostics code is longer than the program. Importing LWP::Simple into our program automatically brings the get function with it, which takes a URL, retrieves its contents with GET, and returns the body of the response. In this example, we print that output to the screen.

Once the document's contents are stored in $content, we can treat it as a normal Perl scalar, albeit one containing a fair amount of text. At this point, we could search for interesting text, perform search-and-replace operations on $content, remove any parts we find offensive, or even translate parts into Pig Latin. As an example, the following variation of this simple program turns the contents around, reversing every line so that the final line becomes the first line and vice versa; and every character on every line so that the final character becomes the first and vice versa:

#!/usr/bin/perl -w
use strict;
use diagnostics;
use LWP::Simple;
# Get the contents
my $content = get "http://www.lerner.co.il/";
# Print the contents
print scalar reverse $content, "\n";

Note how we must put reverse in scalar context in order for it to do its job. Since print takes a list of arguments, we force scalar context with the scalar keyword.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix