A Web-Based Clipping Service

Let LWP turn your web client into a midnight marauder.
Sorting Through the Output

It is handy to be able to download all or part of a web site. However, our initial goal was to be able to sort through the contents of a web site for one or more phrases of interest to us.

Such a program is not very different from download-recursively.pl. Our new version, download-matching.pl, differs in that it stores only messages that contain one or more of the phrases stored in an external file, phrase-file.txt. The code for both of these programs can be found in the file ftp.linuxjournal.com/pub/lj/listings/issue68/3714.tgz.

There are several ways to perform such checking and matching. I chose to do it in a relatively simple but straightforward way, iterating through each phrase in the file and using Perl's built-in string-matching mechanism.

Since the phrases will remain constant during the entire program, we load them from phrase-file.txt before the while loop begins:

my $phrase_file = "phrase-file.txt";
    my @phrases;
    open PHRASES, $phrase_file or die
    "Cannot read $phrase_file: $! ";
    while (<PHRASES>)
        push @phrases, $_;
    close PHRASES;

The above code iterates through each line of the phrase file, removing the trailing newline (with chomp) and then storing the line in @phrases. Each phrase must be on its own line in the phrase file; one possible file could look like this:

Once @phrases contains all of the phrases for which we want to search, download-matching.pl proceeds much like its less discriminating predecessor. The difference comes into play after the callback has already been invoked, scanning through the file for any new links. A site's table of contents might not contain any of the strings in @phrases, but the documents to which it points might.

After collecting new links, but before writing the file to disk, download-matching then iterates through the phrases in @phrases, comparing each one with the document. If it finds a match, it sets $did_match to 1 and exits from the loop:

foreach my $phrase (@phrases)
        if ($content =~ m/>.*[^<]*\b$phrase\b/is)
            # Did we match?
            $did_match = 1;
            print "        Matched $phrase\n";
            # Exit from the foreach if we found a
            # match

Notice how we surround $phrase with \b. This is Perl's way of denoting a separation between words, and ensures that our phrases do not appear in the middle of a word. For instance, if we were to search for “vest”, the \b metacharacters ensure that download-matching.pl will not match the word “investments”.

If $did_match is set to a non-zero value, at least one of the phrases was found in the document. (We use the /i option to Perl's m// matching operator to indicate that the search should be case-insensitive. If you prefer to make capital letters distinct from lowercase letters, remove the /i.) If $did_match is set to 0, we use next to go to the next iteration of the while loop, and thus to the next URL in %to_be_retrieved.

This all presumes a Boolean “or” match, in which only one of the phrases needs to match. If we want to insist that all of our phrases appear in a file to get a positive result (an “and” match), we must alter our strategy somewhat. Instead of setting $did_match to 1, we increment it each time a match is found. We then compare the value of $did_match with the number of elements in @phrases; if they are equal, we can be sure all of the phrases were contained in the document.

If possible, we want to avoid matching text contained within HTML tags. While writing this article for instance, I was surprised to discover just how many articles on Wired News (a technical news source) matched the word “mortgage”. In the end, I found the program was matching a phrase within HTML tags, rather than the text itself. We can avoid this problem by stripping the HTML tags from the file—but that would mean losing the ability to navigate through links when reading the downloaded files.

Instead, download-matching.pl copies the contents of the currently examined file into a variable ($content) and removes the HTML tags from it:

my $content = $response->content;
    $content =~ s|<.+?>||gs;

Notice how we use the g and s options to the substitution operator (s///), removing all pairs of HTML tags, even if they are separated by a newline character. (The s option includes the newline character in the definition of ., which is normally not the case.)

We avoid the ramifications of a greedy regular expression, in which Perl tries to match as much as possible, by putting a ? after the +. If we were to replace <.+>, rather than <.+?>, we would remove everything between the first < and the final > in the file—which would probably include the contents, as well as the HTML tags.

One final improvement of download-matching.pl over download-recursively.pl is that it can handle multiple command-line arguments. If @ARGV contains one or more arguments, these are used to initially populate %to_be_searched. If @ARGV is empty, we assign a default URL to $ARGV[0]. In both cases, the elements of @ARGV are turned into keys of %to_be_retrieved:

foreach my $url (@ARGV)
        print "    Adding $url to the list...\n"
        if $DEBUGGING;
        $to_be_retrieved{$url} = 1;