Client-Side Web Scripting
There are many web browsers and FTP clients for Linux, all rich in features and able to satisfy all users, from command-line fanatics to 3-D multiscreen desktop addicts. They all share one common defect, however: you have to be at the keyboard to drive them. Of course, fine tools like wget can mirror a whole site while you sleep, but you still have to find the right URL first, and when it's finished you must read through every bit that was downloaded anyway.
With small, static sites, it's no big deal, but what if every day you want to download a page that is given a random URL? Or what if you don't want to read 100K of stuff just to scroll a few headlines?
Enter client-side web scripting, i.e., all the techniques that allow you to spend time only looking at web pages (or parts of them) that interest you, and only after your computer found them for you. With such scripts you could read only the traffic or weather information related to your area, download only certain pictures from a web page or automatically find the single link you need.
Besides saving time, client-side web scripting lets you learn about some important issues and teaches you some self-discipline. For one thing, doing indiscriminately what is explained here may be considered copyright infringement in some cases or may consume so much bandwidth as to cause the shutdown of your internet account or worse. On the other hand, this freedom to surf is possible only as long as web pages remain in nonproprietary languages (HTML/XML), written in nonproprietary ASCII.
Finally, many fine sites can survive and remain available at no cost only if they send out enough banners, so all this really should be applied with moderation.
As usual, before doing something from scratch, one should check what has already been done and reuse it, right? A quick search on Freshmeat.net for “news ticker” returns 18 projects, from Kticker to K.R.S.S to GKrellM Newsticker.
These are all very valid tools, but they only fetch news, so they won't work without changes in different cases. Furthermore, they are almost all graphical tools, not something you can run as a cron entry, maybe piping the output to some other program.
In this field, in order to scratch only your very own itch, it is almost mandatory to write something for yourself. This is also the reason why we don't present any complete solution here, but rather discuss the general methodology.
The only prerequisites to take advantage of this article are to know enough Perl to put together some regular expressions and the following Perl modules: LWP::UserAgent, LWP::Simple, HTML::Parse, HTML::Element, URI::URL and Image::Grab. You can fetch these from CPAN (www.cpan.org). Remember that, even if you do not have the root password of your system (typically on your office computer), you still can install them in the directory of your choice, as explained in the Perl documentation and the relevant README files.
Everything in this article has been tested under Red Hat Linux 7.2, but after changing all absolute paths present in the code, should work on every UNIX system supporting Perl and the several external applications used.
All the tasks described below, and web-client scripting in general, require that you can download and store internally for further analysis the whole content of some initial web page, its last modification date, a list of all the URLs it contains or any combination of the above. All this information can be collected with a few lines of code at the beginning of each web-client script, as shown in Listing 1.
The code starts with the almost mandatory “use strict” directive and then loads all the required Perl modules. Once that is done, we proceed to save the whole content of the web page in the $HTML_FILE variable via the get() method. With the instruction that follows, we save each line of the HTTP header in one element of the @HEADER array. Finally, we define an array (@ALL_URLS), and with a for() cycle, we extract and save inside it all the links contained in the original web page, making them absolute if necessary (with the abs() method). At the end of the cycle, the @ALL_URLS array will contain all the URLs found in the initial document.
A complete description of the Perl methods used in this code, and much more, can be found in the book Web Client Programming (see Resources).
Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Firefox 46.0 Released
- Ubuntu Online Summit
- Devuan Beta Release
- The Qt Company's Qt Start-Up
- The US Government and Open-Source Software
- May 2016 Issue of Linux Journal
- The Death of RoboVM
- New Container Image Standard Promises More Portable Apps
- Open-Source Project Secretly Funded by CIA
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
In modern computer systems, privacy and security are mandatory. However, connections from the outside over public networks automatically imply risks. One easily available solution to avoid eavesdroppers’ attempts is SSH. But, its wide adoption during the past 21 years has made it a target for attackers, so hardening your system properly is a must.
Additionally, in highly regulated markets, you must comply with specific operational requirements, proving that you conform to standards and even that you have included new mandatory authentication methods, such as two-factor authentication. In this ebook, I discuss SSH and how to configure and manage it to guarantee that your network is safe, your data is secure and that you comply with relevant regulations.Get the Guide