Client-Side Web Scripting

Marco shows you how to read or download only the parts that interest you from a web page.
Download Web Pages from the Command Line

After having collected all this material, we can start to use it. If you simply want to save the content of a web page on your disk for later reading, you have to add a print instruction to the original script:

print $HTML_FILE;

And then run it from your shell prompt:

./webscript.pl http://www.fsf.org > fsf.html
This will allow you to save the whole page in the local file fsf.html. Keep in mind, however, that if this is all you want, wget is a better choice (see Resources, “Downloading without a Browser”).

Save the Images Contained in a Web Page to Disk

If all the absolute URLs are already inside the @ALL_URLS array, we can download all the images with the following for() cycle:

foreach my $GRAPHIC_URL (grep /(gif|jpg|png)$/,
@ALL_URLS) {
  $GRAPHIC_URL =~ m/([^\/]+)$/;
  my $BASENAME = $1;
  print STDERR "SAVING  $GRAPHIC_URL
  in $BASENAME....\n";
  my $IMG = get ($GRAPHIC_URL);
  open (IMG_FILE, "> $BASENAME") ||
  die "Failed opening $BASENAME\n";
  print IMG_FILE $IMG;
  close IMG;
  }

The loop operates on all the URLs contained in the document ending with the .gif, .jpg or .png extension (extracted from the original array with the grep instruction). First, the regular expression finds the actual filename, defined as everything in the URL from the rightmost slash sign to the end; this should be generalized to deal with URLs hosted on those systems so twisted that even the directory separator is backward.

The result of the match is loaded in the $BASENAME variable, and the image itself is saved with the already known get() method inside $IMG. After that, we open a file with the proper name and print the whole thing inside it.

Of course, many times you will not be interested in all the images (especially because many of them usually will be advertising banners, the site logo or other uninteresting stuff). In situations like this, a simple look at the HTML source will help you figure out what sets the image you need apart from the rest. For example, you may find out that the interesting picture has a random name but is always the third one in the list. If this is the case, modify the previous loop as follows:

my $IMG_COUNT  = 0;
my $WANTED_IMG = 3;
foreach my $GRAPHIC_URL (grep /(gif|jpg|png)$/,
@ALL_URLS) {
        $IMG_COUNT++;
        next unless ($IMG_COUNT == $WANTED_IMG);
        # rest of loop as before.....
        last if ($IMG_COUNT == $WANTED_IMG);
        }
print "FILE NOT FOUND TODAY\n" if
($IMG_COUNT != $WANTED_IMG);

The first instruction in the loop increments the image counter; the second jumps to the next iteration until we reach the third picture. The “last” instruction avoids unnecessary iterations, and the one after the loop informs that the script could not perform the copy because it found less than $WANTED_IMG pictures in the source code.

If the image name is not completely random, it's even easier because you can filter directly on it in the grep instruction at the beginning:

foreach my $GRAPHIC_URL
(grep /(^daily(\d+).jpg)$/, @ALL_URLS) {

This will loop only on files whose names start with the “daily” string, followed by any number of digits (\d+) and a .jpg extension.

The two techniques can be combined at will, and much more sophisticated things are possible. If you know that the picture name is equal to the page title plus the current date expressed in the YYYYMMDD format, first extract the title:

$HTML_FILE =~ m/<TITLE>([^<]+)<\/TITLE>/;
my $TITLE = $1;

Then calculate the date:

my ($sec, $min, $hour, $day, $month, $year, @dummy)
= localtime(time);
$month++;      # months start at 0
$year += 1900; # Y2K-compliant, of course ;-)))
$TODAY = $year.$month.$day;
And finally, filter on this:
foreach my $GRAPHIC_URL
(grep /(^$TITLE$TODAY.jpg)$/, <@>ALL_URLS) {

Extract and Display Only One Specific Section of Text

Now it starts to get really interesting. Customizing your script to fetch only a certain section of the web page's text usually requires more time and effort than any other operation described here because it must be done almost from scratch on each page and repeated if the page structure changes. If you have a slow internet connection, or even a fast one but cannot slow down your MP3 downloads or net games, you rapidly will recover the time spent to prepare the script. You also will save quite a bit of money, if you (like me) still pay per minute.

You have to open and study the HTML source of the original web page to figure out which Perl regular expression filters out all and only the text you need. The Perl LWP library already provides methods to extract all the text out of the HTML code. If you only want a plain ASCII version of the whole content, go for them.

You may be tempted to let the LWP library extract the whole text from the source, and then work on it, even when you only want to extract some lines from the web page. I have found this method to be much more difficult to manage in real cases, however. Of course, the ASCII formatting makes the text immediately readable to a human, but it also throws out all the HTML markup that is so useful to tell the script which parts you want to save. The easiest example of this false start is if you want to save or display all and only the news titles, and they are marked in the source with the <H1></H1> tags. Those markers are trivial to use in a Perl regular expression, but once they are gone, it becomes much harder to make the script recognize headlines.

To demonstrate the method on a real web page, let's try to print inside our terminal all the press-release titles from the FSF page at www.fsf.org/press/press.html. Pointing our script at this URL will save all its content inside the $HTML_FILE variable. Now, let's apply to it the following sequence of regular expressions (I suggest that you also look at that page and at its source code with your browser to understand everything going on):

$HTML_FILE =~ s/.*>Press Releases<//gsmi;
$HTML_FILE =~ s/.*<DL>//gsmi;
$HTML_FILE =~ s/<\/DL>.*$//gsmi;
$HTML_FILE =~ s/<dt>([^<]*)<\/dt>/-> $1: /gi;
$HTML_FILE =~ s/<dd><a href=[^>]*>([^<]*)<\/a>/
$1 /gsmi;
$HTML_FILE =~
s/\.\s+\([^\)]*\.\)<\/dd>/<DD>/gsmi;
$HTML_FILE =~ s/\s+/ /gsmi;
$HTML_FILE =~ s/<DD>/\n/gsmi;

The first three lines cut off everything before and after the actual press-release list. The fourth one finds the date and strips the HTML tags out of it. Regexes number five and six do the same thing to the press-release subject. The last two eliminate redundant white spaces and put new lines where needed. As of December 14, 2001, the output at the shell prompt looks like this (titles have been manually cut by me for better formatting):

-> 3 December 2001: Stallman Receives Prestigious...
-> 22 October 2001: FSF Announces Version 21 of the...
-> 12 October 2001: Free Software Foundation
   Announces...
-> 24 September 2001: Richard Stallman and
   Eben Moglen...
-> 18 September 2001: FSF and FSMLabs come
   to agreement...
The set of regular expressions above is not complete; for one thing, it doesn't manage news with update sections. One also should make it as independent as possible from extra spaces inside HTML tags or changes in the color or size of some fonts. This regular expression strips out all the font markup:
$HTML_FILES =~ s/<font face="Verdana" size="3">
([^<]+)<\/font>/$1/g;
This performs the same task but works on any font type and (positive) font size:
$HTML_FILES =~ s/<font face="[^"]+"
size="\d+">
([^<]+)<\/font>/$1/g;
The example shown here, however, still is detailed enough to show the principle, and again the one-time effort to write a custom set for any given page really can save a lot of time.

______________________

Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState