Working with LWP

This month Mr. Lerner takes a look at the library for web programming and its associated modules.
HTTP::Request and HTTP::Response

There are times, however, when we will want to create more sophisticated applications, which in turn require more sophisticated means of contacting the server. Doing this will require a number of different objects, each with its own task.

First and foremost, we will have to create an HTTP::Request object. This object, as you can guess from its name, handles everything having to do with an HTTP request. We can create it most easily by saying:

use HTTP::Request;
my $request = new HTTP::Request("GET",
   "http://www.lerner.co.il");

where the first argument indicates the request method we wish to use, and the second argument indicates the target URL.

Once we have created an HTTP request, we need to send it to the server. We do this with a “useragent” object, which acts as a go-between in this exchange. We have already looked at the LWP::Simple useragent in our example programs above.

Normally, a useragent takes an HTTP::Request object as an argument, and returns an HTTP::Response object. In other words, given $request as defined above, our next two steps would be the following:

my $ua = new LWP::UserAgent;
my $response = $ua->request($request);

After we have created an HTTP::Response and assigned it to $response, we can perform all sorts of tests and operations.

For starters, we probably want to know the response code we received as part of the response, to ensure that our request was successful. We can get the response code and the accompanying text message with the code and message methods:

my $response_code = $response->code;
my $response_message = $response->message;

If we then say:

print qq{Code: "$code"\n};
print qq{Message: "$message"\n};
we will get the output:
Code: "200"
Message: "OK"
This is well and good, but it presents a bit of a problem: How do we know how to react to different response codes? We know that 200 means everything was fine, but we must build up a table of values in order to know which response codes mean we can continue, and which mean the program should exit and signal an error.

The is_success method for HTTP::Response handles this for us. With it, we can easily check to see if our request went through and if we received a response:

if (!$response->is_success)
   {print "Success. \n";}
else
   {print "Error: " . $response->status_line . "\n";
}

The status_line method combines the output from code and message to produce a numeric response code and its printed description.

We can examine the response headers with the headers method. This returns an instance of HTTP::Headers, which offers many convenient methods that allow us to retrieve individual header values:

my $headers = $response->headers;
print "Content-type:", $headers->content_type,
   "\n";
print "Content—length:",
   $headers->content_length, "\n";
print "Date:", $headers->date, "\n
print "Server:", $headers->server, "\n";
Working with the Content

Of course, the Web is not very useful without the contents of the documents we retrieve. HTTP::Response has only one method for retrieving the content of the response, unsurprisingly named content. We can thus say:

my $content = $response->content;

At this point, we are back to where we were with our LWP::Simple example earlier: We have the content of the document inside of $content, which stores the text as a normal string.

Listing 1

If we were interested in using HTTP::Request and HTTP::Response to reverse a document, we could do it as shown in Listing 1. If you were really interested in producing a program like this one, you would probably stick with LWP::Simple and use the get method described there. There is no point in weighing down your program, as well as adding all sorts of method calls, if the object is simply to retrieve the contents from a URL.

The advantage of using a more sophisticated user agent is the additional flexibility it offers. Whether that flexibility is worth the tradeoff in complexity will depend on your needs.

For example, many sites have a robots.txt in their root directory. Such a file tells “robots”, or software-controlled web clients, which sections of the site should be considered out of bounds. These files are a matter of convention, but are a long-standing tradition which should be followed. Luckily, LWP includes the object LWP::RobotUA, which is a user agent that automatically checks the robots.txt file before retrieving a file from a web site. If a file is excluded by robots.txt, LWP::RobotUA will not retrieve it.

LWP::RobotUA also differs from LWP::UserAgent in its attempt to be gentle to servers, by sending only one request per minute. We can change this with the delay method, although doing so is advisable only when you are familiar with the site and its ability to handle a large number of automatically generated requests.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix