A Web-Based Clipping Service
In November, we saw how Perl's Library for Web Programming (LWP) can be used to create a simple HTTP client, retrieving one or more pages from the Web. This month, we will extend those efforts to create a program that can not only retrieve pages from the Web, but categorize them according to our preferences. In this way, we can create our own web-based clipping service, finding those articles that are of particular interest to us.
LWP consists of several modules which allow us to work with HTTP, the “hypertext transfer protocol”. HTTP works on a stateless request-response basis: a client connects to a server and submits a request. The server then generates a response, and closes the connection. (If you missed last month's column, it is available here: Working with LWP. You should read that article before continuing.)
We need a program that will go to a particular URL and save the contents of that URL on disk. Furthermore, we want to follow any hyperlinks in that document to collect other news stories. However, we do not want to follow links to other sites; this not only reduces the chances that we will get sidetracked, but avoids the possibility of being led astray too much.
In other words, I would like to be able to point a program at a site and retrieve all of its files on to the disk. A first stab at such a program, download-recursively.pl, is similar to the simple robot program we explored last month. It uses two hashes, %already_retrieved and %to_be_retrieved, to store URLs. Rather than storing the URLs as values in the hash, we use them as keys. This ensures each URL will appear only once, avoiding infinite loops and miscounting. URLs are placed in %to_be_retrieved when they are first encountered, then moved to %already_retrieved after their contents are retrieved. $origin, a scalar variable that contains the initial URL, has a default setting if no argument is provided on the command line.
Retrievals are performed with a while loop. Each iteration of the while loop retrieves another URL from %to_be_retrieved, and uses it to create a new instance of HTTP::Request.
The method $response->last_modified returns the date and time on which a document was last modified. Subtracting $response->last_modified from the current time, and then comparing this result with the maximum age of documents we wish to see ($maximum_age) allows us to filter out relatively old documents:
my $document_age = time -
$response->last_modified;
if ($document_age > $maximum_age)
{
print STDOUT
" Age of document: $document_age\n";
next;
}
If the document is too old, we use next to return us to the next iteration of the while loop—and thus the next URL to be retrieved.
Next, we parse the contents of the HTTP response, using the HTML::LinkExtor module. When we create an instance of HTML::LinkExtor, we are actually creating a simple parser that can look through a page of HTML and return one or more pieces of information. The analysis is performed by a subroutine, often named callback. A reference to callback is passed along with the URL that will be parsed to create a new instance of HTML::LinkExtor.
my $parser = HTML::LinkExtor->new (\&callback, $url);
The resulting object can then parse our URL's contents by invoking:
$parser->parse($response->content);When $parser->parse is invoked, &callback is invoked once for each HTML tag in the file. Our version of &callback grabs each URL in the file from the href attribute of each <a> tag, saving it in %to_be_retrieved unless it exists in %already_retrieved.
Finally, we save the retrieved document on the local file system. The tricky part of saving the file to disk has to do with the way in which we are retrieving the URLs—we are not traversing a tree of URLs, but are pulling URLs out in their hash order. This means the URL http://foo.com/a/b/c/ might be retrieved before http://foo.com/a/index.html. Thus, we need to ensure that the directory /a/b/c exists on our local system before /a and /a/b are created. How can we do this?
My solution was to use Perl's built-in split operator, which turns a scalar into a list. By assigning this list of partial directories into an array (@output_directory), we can sequentially build up the directory from the root (/) down to the final name. Along the way, we check to see if the directory exists. If it does not, we create the new directory with mkdir. If the directory does not exist and mkdir fails, we exit with a fatal error, indicating what error occurred.
Those URLs that lack a file name are given one of “index.html”. Since this is the default file name accessed on many web servers, it stands to reason this will probably not collide with any of those names.
The end result of running this program is a mirror of the downloaded site, stored in $output_directory.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- RSS Feeds
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Designing Electronics with Linux
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- What's the tweeting protocol?
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




3 hours 10 min ago
7 hours 37 min ago
11 hours 13 min ago
11 hours 45 min ago
14 hours 9 min ago
14 hours 12 min ago
14 hours 13 min ago
18 hours 38 min ago
20 hours 29 min ago
1 day 1 hour ago