Working with LWP
Once we have retrieved the content from a web site, what can we do with it? As demonstrated above, we can print it out or play with the text. But many times, we want to analyze the tags in the document, picking out the images, the hyperlinks or even the headlines.
In order to do this, we could use regular expressions and m//, Perl's matching operator. But an easier way is to use HTML::LinkExtor, another object that is designed for this purpose. Once we create an instance of HTML::Extor, we can then use the parse method to retrieve each of the tags in it.
HTML::LinkExtor works differently from many modules you might have used before, in that it uses a “callback”. In this case, a callback is a subroutine defined to take two arguments—a scalar containing the name of the tag and a hash containing the name,value pairs associated with that tag. The subroutine is invoked each time HTML::LinkExtor finds a tag.
For example, given the HTML
<input type="text" value="Reuven" name="first_name" size="5">
our callback would have to be prepared to handle a scalar of value input, with a hash that looks like
(type => "text", value => "Reuven", name => "first_name", size => "5")Listing 2
If we are interested in printing the various HTML tags to the screen, we could write a simple callback that looks like Listing 2. How do we tell HTML::LinkExtor to invoke our callback subroutine each time it finds a match? The easiest way would be for us to hand callback to the parse method as an argument.
Perl allows us to pass subroutines and other blocks of code as if they were data by creating references to them. A reference looks and acts like a scalar, except that it can be turned into something else. Perl has scalar, array and hash references; subroutine references fit naturally into this picture as well. HTML::LinkExtor will dereference and use our subroutine as we have defined it.
We turn a subroutine into a subroutine reference by prefacing its name with \&. Perl 5 no longer requires that you put & before subroutine names, but it is required when you are passing a subroutine reference. The backslash tells Perl we want to turn the object in question into a reference. If &callback is defined as above, then we can print out all of the links in a document with the following:
my $parser = HTML::LinkExtor->new(\&callback); $parser->parse($response->content);
Note that $content might have all HTML links that were returned with the HTTP response. However, that response undoubtedly contained some relative URLs, which will not be interpreted correctly out of context. How can we accurately view the link?
HTML::LinkExtor takes that into account, and allows us to pass two arguments to its constructor (new), rather than just one. The second argument, which is optional, is the URL from which we received this content. Passing this URL ensures all URLs we extract will be complete. We must include the line
in our application if we want to use this feature. We can then say
my $parser = HTML::LinkExtor->new(\&callback, "http://www.lerner.co.il/"); $parser->parse($response->content);and our callback will be invoked for each tag, with a full, absolute URL even if the document contains a relative one.
Our version of &callback above prints out all links, not just hyperlinks. We can ignore all but “anchor” tags, which allow us to create hyperlinks by modifying &callback slightly, as shown in Listing 3.
With all of this under our belts, we will write an application (Listing 4) that follows links recursively until the program is stopped. This sort of program can be useful for checking links on your site or harvesting information from documents.
Our program, download-recursively.pl, starts at the URL called $origin and collects the URLs contained within it, placing them in the hash %to_be_retrieved. It then goes through each of those URLs one by one, collecting any hyperlinks that might be contained within them. Each time it retrieves a URL, download-recursively.pl places it in %already_retrieved. This ensures we will not download the same URL twice.
We create $ua, our instance of LWP::RobotUA, outside of the “while” loop. After all, our HTTP requests and responses will be changing with each loop iteration, but our user agent can remain the same throughout the life of the program.
We go through each URL in %to_be_retrieved in a seemingly random order, taking the first item returned by keys. It is obviously possible to sort the keys before taking the first element from the resulting list or to do a depth-first or breadth-first search through the list of URLs.
Inside the loop, the code is as we might expect: we create a new instance of HTTP:Request and pass it to $ua, receiving a new instance of HTTP:Response in return. Then we parse the response content with HTML::LinkExtor, putting each new URL in %to_be_retrieved, but only on the condition that it is not already a key in %already_retrieved.
You may find it interesting to let this program go for a while, following links from one of your favorite sites. The Web is all about linking; see who is linking to whom. You might be surprised by what you find.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- <Watch> HD! Watch Walking On Sunshine Online Full Movie Streaming
- SUSE LLC's SUSE Manager
- New Products
- New Products
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Tighter SSH Security with Two-Factor Authentication
- Tails above the Rest: the Installation
- A Web-Based Linux Training Course
- Python Programming for Beginners
- Advanced Memory Allocation
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide