Client-Side Web Scripting

Marco shows you how to read or download only the parts that interest you from a web page.

There are many web browsers and FTP clients for Linux, all rich in features and able to satisfy all users, from command-line fanatics to 3-D multiscreen desktop addicts. They all share one common defect, however: you have to be at the keyboard to drive them. Of course, fine tools like wget can mirror a whole site while you sleep, but you still have to find the right URL first, and when it's finished you must read through every bit that was downloaded anyway.

With small, static sites, it's no big deal, but what if every day you want to download a page that is given a random URL? Or what if you don't want to read 100K of stuff just to scroll a few headlines?

Enter client-side web scripting, i.e., all the techniques that allow you to spend time only looking at web pages (or parts of them) that interest you, and only after your computer found them for you. With such scripts you could read only the traffic or weather information related to your area, download only certain pictures from a web page or automatically find the single link you need.

Mandatory Warning about Copyright and Bandwidth Issues

Besides saving time, client-side web scripting lets you learn about some important issues and teaches you some self-discipline. For one thing, doing indiscriminately what is explained here may be considered copyright infringement in some cases or may consume so much bandwidth as to cause the shutdown of your internet account or worse. On the other hand, this freedom to surf is possible only as long as web pages remain in nonproprietary languages (HTML/XML), written in nonproprietary ASCII.

Finally, many fine sites can survive and remain available at no cost only if they send out enough banners, so all this really should be applied with moderation.

What Is Available

As usual, before doing something from scratch, one should check what has already been done and reuse it, right? A quick search on Freshmeat.net for “news ticker” returns 18 projects, from Kticker to K.R.S.S to GKrellM Newsticker.

These are all very valid tools, but they only fetch news, so they won't work without changes in different cases. Furthermore, they are almost all graphical tools, not something you can run as a cron entry, maybe piping the output to some other program.

In this field, in order to scratch only your very own itch, it is almost mandatory to write something for yourself. This is also the reason why we don't present any complete solution here, but rather discuss the general methodology.

What Is Needed

The only prerequisites to take advantage of this article are to know enough Perl to put together some regular expressions and the following Perl modules: LWP::UserAgent, LWP::Simple, HTML::Parse, HTML::Element, URI::URL and Image::Grab. You can fetch these from CPAN (www.cpan.org). Remember that, even if you do not have the root password of your system (typically on your office computer), you still can install them in the directory of your choice, as explained in the Perl documentation and the relevant README files.

Everything in this article has been tested under Red Hat Linux 7.2, but after changing all absolute paths present in the code, should work on every UNIX system supporting Perl and the several external applications used.

Collecting the Basic Information

All the tasks described below, and web-client scripting in general, require that you can download and store internally for further analysis the whole content of some initial web page, its last modification date, a list of all the URLs it contains or any combination of the above. All this information can be collected with a few lines of code at the beginning of each web-client script, as shown in Listing 1.

Listing 1. Collecting the Basic Information

The code starts with the almost mandatory “use strict” directive and then loads all the required Perl modules. Once that is done, we proceed to save the whole content of the web page in the $HTML_FILE variable via the get() method. With the instruction that follows, we save each line of the HTTP header in one element of the @HEADER array. Finally, we define an array (@ALL_URLS), and with a for() cycle, we extract and save inside it all the links contained in the original web page, making them absolute if necessary (with the abs() method). At the end of the cycle, the @ALL_URLS array will contain all the URLs found in the initial document.

A complete description of the Perl methods used in this code, and much more, can be found in the book Web Client Programming (see Resources).

______________________

Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState