A Web-Based Clipping Service

Let LWP turn your web client into a midnight marauder.
Sorting Through the Output

It is handy to be able to download all or part of a web site. However, our initial goal was to be able to sort through the contents of a web site for one or more phrases of interest to us.

Such a program is not very different from download-recursively.pl. Our new version, download-matching.pl, differs in that it stores only messages that contain one or more of the phrases stored in an external file, phrase-file.txt. The code for both of these programs can be found in the file ftp.linuxjournal.com/pub/lj/listings/issue68/3714.tgz.

There are several ways to perform such checking and matching. I chose to do it in a relatively simple but straightforward way, iterating through each phrase in the file and using Perl's built-in string-matching mechanism.

Since the phrases will remain constant during the entire program, we load them from phrase-file.txt before the while loop begins:

my $phrase_file = "phrase-file.txt";
    my @phrases;
    open PHRASES, $phrase_file or die
    "Cannot read $phrase_file: $! ";
    while (<PHRASES>)
    {
        chomp;
        push @phrases, $_;
    }
    close PHRASES;

The above code iterates through each line of the phrase file, removing the trailing newline (with chomp) and then storing the line in @phrases. Each phrase must be on its own line in the phrase file; one possible file could look like this:

Linux
Reuven
mortgage
Once @phrases contains all of the phrases for which we want to search, download-matching.pl proceeds much like its less discriminating predecessor. The difference comes into play after the callback has already been invoked, scanning through the file for any new links. A site's table of contents might not contain any of the strings in @phrases, but the documents to which it points might.

After collecting new links, but before writing the file to disk, download-matching then iterates through the phrases in @phrases, comparing each one with the document. If it finds a match, it sets $did_match to 1 and exits from the loop:

foreach my $phrase (@phrases)
    {
        if ($content =~ m/>.*[^<]*\b$phrase\b/is)
        {
            # Did we match?
            $did_match = 1;
            print "        Matched $phrase\n";
            # Exit from the foreach if we found a
            # match
            last;
        }
    }

Notice how we surround $phrase with \b. This is Perl's way of denoting a separation between words, and ensures that our phrases do not appear in the middle of a word. For instance, if we were to search for “vest”, the \b metacharacters ensure that download-matching.pl will not match the word “investments”.

If $did_match is set to a non-zero value, at least one of the phrases was found in the document. (We use the /i option to Perl's m// matching operator to indicate that the search should be case-insensitive. If you prefer to make capital letters distinct from lowercase letters, remove the /i.) If $did_match is set to 0, we use next to go to the next iteration of the while loop, and thus to the next URL in %to_be_retrieved.

This all presumes a Boolean “or” match, in which only one of the phrases needs to match. If we want to insist that all of our phrases appear in a file to get a positive result (an “and” match), we must alter our strategy somewhat. Instead of setting $did_match to 1, we increment it each time a match is found. We then compare the value of $did_match with the number of elements in @phrases; if they are equal, we can be sure all of the phrases were contained in the document.

If possible, we want to avoid matching text contained within HTML tags. While writing this article for instance, I was surprised to discover just how many articles on Wired News (a technical news source) matched the word “mortgage”. In the end, I found the program was matching a phrase within HTML tags, rather than the text itself. We can avoid this problem by stripping the HTML tags from the file—but that would mean losing the ability to navigate through links when reading the downloaded files.

Instead, download-matching.pl copies the contents of the currently examined file into a variable ($content) and removes the HTML tags from it:

my $content = $response->content;
    $content =~ s|<.+?>||gs;

Notice how we use the g and s options to the substitution operator (s///), removing all pairs of HTML tags, even if they are separated by a newline character. (The s option includes the newline character in the definition of ., which is normally not the case.)

We avoid the ramifications of a greedy regular expression, in which Perl tries to match as much as possible, by putting a ? after the +. If we were to replace <.+>, rather than <.+?>, we would remove everything between the first < and the final > in the file—which would probably include the contents, as well as the HTML tags.

One final improvement of download-matching.pl over download-recursively.pl is that it can handle multiple command-line arguments. If @ARGV contains one or more arguments, these are used to initially populate %to_be_searched. If @ARGV is empty, we assign a default URL to $ARGV[0]. In both cases, the elements of @ARGV are turned into keys of %to_be_retrieved:

foreach my $url (@ARGV)
    {
        print "    Adding $url to the list...\n"
        if $DEBUGGING;
        $to_be_retrieved{$url} = 1;
    }
______________________

Webcast
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers

Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.

Learn More

Sponsored by AMD

White Paper
Red Hat White Paper: Using an Open Source Framework to Catch the Bad Guy

Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6

Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.

Learn more about catching the bad guy in this free white paper.

Learn More

Sponsored by DLT Solutions