A Web-Based Clipping Service

Let LWP turn your web client into a midnight marauder.
Sorting Through the Output

It is handy to be able to download all or part of a web site. However, our initial goal was to be able to sort through the contents of a web site for one or more phrases of interest to us.

Such a program is not very different from download-recursively.pl. Our new version, download-matching.pl, differs in that it stores only messages that contain one or more of the phrases stored in an external file, phrase-file.txt. The code for both of these programs can be found in the file ftp.linuxjournal.com/pub/lj/listings/issue68/3714.tgz.

There are several ways to perform such checking and matching. I chose to do it in a relatively simple but straightforward way, iterating through each phrase in the file and using Perl's built-in string-matching mechanism.

Since the phrases will remain constant during the entire program, we load them from phrase-file.txt before the while loop begins:

my $phrase_file = "phrase-file.txt";
    my @phrases;
    open PHRASES, $phrase_file or die
    "Cannot read $phrase_file: $! ";
    while (<PHRASES>)
        push @phrases, $_;
    close PHRASES;

The above code iterates through each line of the phrase file, removing the trailing newline (with chomp) and then storing the line in @phrases. Each phrase must be on its own line in the phrase file; one possible file could look like this:

Once @phrases contains all of the phrases for which we want to search, download-matching.pl proceeds much like its less discriminating predecessor. The difference comes into play after the callback has already been invoked, scanning through the file for any new links. A site's table of contents might not contain any of the strings in @phrases, but the documents to which it points might.

After collecting new links, but before writing the file to disk, download-matching then iterates through the phrases in @phrases, comparing each one with the document. If it finds a match, it sets $did_match to 1 and exits from the loop:

foreach my $phrase (@phrases)
        if ($content =~ m/>.*[^<]*\b$phrase\b/is)
            # Did we match?
            $did_match = 1;
            print "        Matched $phrase\n";
            # Exit from the foreach if we found a
            # match

Notice how we surround $phrase with \b. This is Perl's way of denoting a separation between words, and ensures that our phrases do not appear in the middle of a word. For instance, if we were to search for “vest”, the \b metacharacters ensure that download-matching.pl will not match the word “investments”.

If $did_match is set to a non-zero value, at least one of the phrases was found in the document. (We use the /i option to Perl's m// matching operator to indicate that the search should be case-insensitive. If you prefer to make capital letters distinct from lowercase letters, remove the /i.) If $did_match is set to 0, we use next to go to the next iteration of the while loop, and thus to the next URL in %to_be_retrieved.

This all presumes a Boolean “or” match, in which only one of the phrases needs to match. If we want to insist that all of our phrases appear in a file to get a positive result (an “and” match), we must alter our strategy somewhat. Instead of setting $did_match to 1, we increment it each time a match is found. We then compare the value of $did_match with the number of elements in @phrases; if they are equal, we can be sure all of the phrases were contained in the document.

If possible, we want to avoid matching text contained within HTML tags. While writing this article for instance, I was surprised to discover just how many articles on Wired News (a technical news source) matched the word “mortgage”. In the end, I found the program was matching a phrase within HTML tags, rather than the text itself. We can avoid this problem by stripping the HTML tags from the file—but that would mean losing the ability to navigate through links when reading the downloaded files.

Instead, download-matching.pl copies the contents of the currently examined file into a variable ($content) and removes the HTML tags from it:

my $content = $response->content;
    $content =~ s|<.+?>||gs;

Notice how we use the g and s options to the substitution operator (s///), removing all pairs of HTML tags, even if they are separated by a newline character. (The s option includes the newline character in the definition of ., which is normally not the case.)

We avoid the ramifications of a greedy regular expression, in which Perl tries to match as much as possible, by putting a ? after the +. If we were to replace <.+>, rather than <.+?>, we would remove everything between the first < and the final > in the file—which would probably include the contents, as well as the HTML tags.

One final improvement of download-matching.pl over download-recursively.pl is that it can handle multiple command-line arguments. If @ARGV contains one or more arguments, these are used to initially populate %to_be_searched. If @ARGV is empty, we assign a default URL to $ARGV[0]. In both cases, the elements of @ARGV are turned into keys of %to_be_retrieved:

foreach my $url (@ARGV)
        print "    Adding $url to the list...\n"
        if $DEBUGGING;
        $to_be_retrieved{$url} = 1;

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState