Client-Side Web Scripting
There are many web browsers and FTP clients for Linux, all rich in features and able to satisfy all users, from command-line fanatics to 3-D multiscreen desktop addicts. They all share one common defect, however: you have to be at the keyboard to drive them. Of course, fine tools like wget can mirror a whole site while you sleep, but you still have to find the right URL first, and when it's finished you must read through every bit that was downloaded anyway.
With small, static sites, it's no big deal, but what if every day you want to download a page that is given a random URL? Or what if you don't want to read 100K of stuff just to scroll a few headlines?
Enter client-side web scripting, i.e., all the techniques that allow you to spend time only looking at web pages (or parts of them) that interest you, and only after your computer found them for you. With such scripts you could read only the traffic or weather information related to your area, download only certain pictures from a web page or automatically find the single link you need.
Besides saving time, client-side web scripting lets you learn about some important issues and teaches you some self-discipline. For one thing, doing indiscriminately what is explained here may be considered copyright infringement in some cases or may consume so much bandwidth as to cause the shutdown of your internet account or worse. On the other hand, this freedom to surf is possible only as long as web pages remain in nonproprietary languages (HTML/XML), written in nonproprietary ASCII.
Finally, many fine sites can survive and remain available at no cost only if they send out enough banners, so all this really should be applied with moderation.
As usual, before doing something from scratch, one should check what has already been done and reuse it, right? A quick search on Freshmeat.net for “news ticker” returns 18 projects, from Kticker to K.R.S.S to GKrellM Newsticker.
These are all very valid tools, but they only fetch news, so they won't work without changes in different cases. Furthermore, they are almost all graphical tools, not something you can run as a cron entry, maybe piping the output to some other program.
In this field, in order to scratch only your very own itch, it is almost mandatory to write something for yourself. This is also the reason why we don't present any complete solution here, but rather discuss the general methodology.
The only prerequisites to take advantage of this article are to know enough Perl to put together some regular expressions and the following Perl modules: LWP::UserAgent, LWP::Simple, HTML::Parse, HTML::Element, URI::URL and Image::Grab. You can fetch these from CPAN (www.cpan.org). Remember that, even if you do not have the root password of your system (typically on your office computer), you still can install them in the directory of your choice, as explained in the Perl documentation and the relevant README files.
Everything in this article has been tested under Red Hat Linux 7.2, but after changing all absolute paths present in the code, should work on every UNIX system supporting Perl and the several external applications used.
All the tasks described below, and web-client scripting in general, require that you can download and store internally for further analysis the whole content of some initial web page, its last modification date, a list of all the URLs it contains or any combination of the above. All this information can be collected with a few lines of code at the beginning of each web-client script, as shown in Listing 1.
The code starts with the almost mandatory “use strict” directive and then loads all the required Perl modules. Once that is done, we proceed to save the whole content of the web page in the $HTML_FILE variable via the get() method. With the instruction that follows, we save each line of the HTTP header in one element of the @HEADER array. Finally, we define an array (@ALL_URLS), and with a for() cycle, we extract and save inside it all the links contained in the original web page, making them absolute if necessary (with the abs() method). At the end of the cycle, the @ALL_URLS array will contain all the URLs found in the initial document.
A complete description of the Perl methods used in this code, and much more, can be found in the book Web Client Programming (see Resources).
Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com
- New Products
- Android Candy: Google Keep
- Readers' Choice Awards 2014
- A Little GUI for Your CLI
- Handling the workloads of the Future
- How Can We Get Business to Care about Freedom, Openness and Interoperability?
- diff -u: What's New in Kernel Development
- Days Between Dates?
- December 2014 Issue of Linux Journal: Readers' Choice
Editorial Advisory Panel
Thank you to our 2014 Editorial Advisors!
- Jeff Parent
- Brad Baillio
- Nick Baronian
- Steve Case
- Chadalavada Kalyana
- Caleb Cullen
- Keir Davis
- Michael Eager
- Nick Faltys
- Dennis Frey
- Philip Jacob
- Jay Kruizenga
- Steve Marquez
- Dave McAllister
- Craig Oda
- Mike Roberts
- Chris Stark
- Patrick Swartz
- David Lynch
- Alicia Gibb
- Thomas Quinlan
- Carson McDonald
- Kristen Shoemaker
- Charnell Luchich
- James Walker
- Victor Gregorio
- Hari Boukis
- Brian Conner
- David Lane