Working with LWP

This month Mr. Lerner takes a look at the library for web programming and its associated modules.
Examining the response

After typing return a second time, you should see the contents of http://www.lerner.co.il/ returned to you. Once the document has been transferred to your terminal, the connection is terminated. If you want to connect to the same server again, you may do so. However, you will have to issue a new connection and a new request.

Just as the client can send request headers before the request itself, the server can send response headers before the response. As in the case with request headers, there must be a blank line between the response headers and the body of the response.

Here are the headers I received after issuing the above GET request:

HTTP/1.1 200 OK
Date: Thu, 12 Aug 1999 19:36:44 GMT
Server: Apache/1.3.6 (UNIX) PHP/3.0.11 FrontPage/3.0.4.2 Rewrit/1.0a
Connection: close
Content-Type: text/html

The above lines are typical for a response.

The first line produces general information about the response, including an indication of what is yet to come. First, the server tells us it is capable of handling anything up to HTTP/1.1. If we ever want to send a request using HTTP/1.1, this server will allow it. After the HTTP version number comes a response code. This code can indicate a variety of possibilities, including whether everything went just fine (200), the file has moved permanently (301), the file was not found (404), or there was an error on the server side (501).

The numeric code is typically followed by a text message, which gives an indication of the meaning behind the numbers. Apache and other servers might allow us to customize the page displayed when an error occurs, but that customization does not extend to this error code, which is standard and fixed.

Following the error code comes the date on which the response was generated. This header is useful for proxies and caches, which can then store the date of a document along with its contents. The next time your browser tries to retrieve a file, it will compare the Date: header from the previous response, retrieving the new version only if the server's version is newer.

The server identifies itself in the Server: header. In this particular case, the server tells us not only that it is Apache 1.3.6 running under a form of UNIX (in this case, Linux), but also some modules that have been installed. My web-space provider has chosen to install PHP, FrontPage and Rewrit; as we have seen in previous months, mod_perl is another popular module for server-side programming, and one which advertises itself in this header.

As we have seen, an HTTP connection terminates after the server finishes sending its response. This can be extremely inefficient; consider a page of HTML that contains five IMG tags, indicating where images should be loaded. In order to download this page in its entirety, a web browser has to create six separate HTTP connections—one for the HTML and one for each of the images. To overcome this inefficiency, HTTP/1.1 allows for “persistent connections”, meaning that more than one document can be retrieved in a single HTTP transaction. This is signalled with the Connection header, which indicated it was ready to close the connection after a single transaction in the example above.

The final header in the above output is Content-type, well-known to CGI programmers. This header uses a MIME-style description to tell the browser what kind of content to expect. Should it expect HTML-formatted text (text/html)? Or a JPEG image (image/jpeg)? Or something that cannot be identified, which should be treated as binary data (application/octet-stream)? Without such a header, your browser will not know how to treat the data it receives, which is why servers often produce error messages when Content-type is missing.

HTTP/1.0 supports many methods other than GET, but the main ones are GET, HEAD, and POST. GET, as its name implies, allows us to retrieve the contents of a link. This is the most common method, and is behind most of the simple retrievals your web browser performs. HEAD is the same as GET, but quits after printing the response headers. Sending a request of

HEAD / HTTP/1.0

is a good way to test your web server and see if it is running correctly.

POST not only names a path on the server's computer, but also sends input in name,value pairs. (GET can also submit information in name,value pairs, but it is considered less desirable in most situations.) POST is usually invoked when a user clicks on the “submit” button in an HTML form.

LWP::Simple

Now that we have an understanding of the basics behind HTTP, let's see how we can handle requests and responses using Perl. Luckily, LWP contains objects for nearly everything we might want to do, with code tested by many people.

If we simply want to retrieve a document using HTTP, we can do so with the LWP::Simple module. Here, for instance, is a simple Perl program that retrieves the root document from my web site:

#!/usr/bin/perl --w
use strict;
use diagnostics;
use LWP::Simple;
# Get the contents
my $content = get "http://www.lerner.co.il/";
# Print the contents
print $content, "\n";

In this particular case, the startup and diagnostics code is longer than the program. Importing LWP::Simple into our program automatically brings the get function with it, which takes a URL, retrieves its contents with GET, and returns the body of the response. In this example, we print that output to the screen.

Once the document's contents are stored in $content, we can treat it as a normal Perl scalar, albeit one containing a fair amount of text. At this point, we could search for interesting text, perform search-and-replace operations on $content, remove any parts we find offensive, or even translate parts into Pig Latin. As an example, the following variation of this simple program turns the contents around, reversing every line so that the final line becomes the first line and vice versa; and every character on every line so that the final character becomes the first and vice versa:

#!/usr/bin/perl -w
use strict;
use diagnostics;
use LWP::Simple;
# Get the contents
my $content = get "http://www.lerner.co.il/";
# Print the contents
print scalar reverse $content, "\n";

Note how we must put reverse in scalar context in order for it to do its job. Since print takes a list of arguments, we force scalar context with the scalar keyword.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState