Smart (Script-Aided) Browsing

How to mirror a web site from within your browser.

Basically, there are two ways to surf the Net: interactively, with any text or graphical browser, or in batch mode, with a program that copies single pages or whole web sites to your hard drive for later use. Script-aided browsing is that part of client side web scripting that makes your use of the Web more efficient and powerful by merging these two techniques in one of the two following ways.

In the first case, you run, either directly or as a dæmon, a script that downloads a web page, extracts from its source code an interesting URL and terminates, thereupon opening your favourite web browser to the corresponding page. Several examples of this first method, applied to Konqueror, Galeon and Netscape (Mozilla uses the same commands as its cousin) have been already described in my article "Client Side Web Scripting", published in the March 2002 issue of Linux Journal.

The second case, also mentioned in that article, is the opposite of the first. That is, during normal interactive web browsing, you notice an hyperlink pointing to an interesting page, and, from within your browser, you launch a web script that will automatically download that page and perform some more or less complex action on it. This action can be anything you can imagine: download all the images contained in that page, list in a pop-up window all the pages it points to and so on. You are limited only by your scripting skills.

Here's an example: mirror a web page and all the pages it points to. Let's assume that you just discovered some new, interesting program. On its home page, a link points directly to the voluminous subsection of the web site containing the complete user manual, and you want to mirror all of the information on your hard disk. The standard tool for these cases is wget, so we don't need to write a new one. However, how do we launch it directly from the web browser, without opening a terminal window and typing the URL by hand? The rest of this article explains how to automate this operation in Konqueror; the example has been tested with the standard KDE, Konqueror and wget tools that come with Red Hat 7.2.

Step 1: Prepare the wget Script

Write a simple shell script that invokes wget with the -m (mirror) option on the first argument and call it wgetscript.sh (or whatever you want, of course). The content of my script is:

                       #!/bin/bash
                       /usr/bin/wget -m -L -t 5 -w 5 $1
                       exit

Put the script in the proper directory (I choose $HOME/bin and make it executable, chmod 755 <filename>.

Step 2: Add the Script to the KDE Application Menu

Following the guidelines in this paragraph of the KDE user guide, www.kde.org/documentation/userguide/adding-programs.html, add the script to the KDE menu. Figure 1 shows what I had to write to accomplish this. The string "mymirror" is the one that actually appears in the menu, and the comment is self-explanatory. The really interesting thing in this picture, i.e., the bit of black magic absolutely essential for the correct working of the whole procedure, is the content of the "Command" box:

/home/marco/bin/wgetscript.sh %u

Apart from using the complete path to the script, what is important is the %u part; this is what will tell Konqueror to launch the script with the complete URL that we selected as the first argument. Notice also that I checked the Run in terminal option. In this way, a Konsole window will open and run your script, and it will be possible to see what happens.

Figure 1. Adding wget to the KDE Menu

Step 3: Launch the Script

Now, to use this script from Konqueror, you have to right-click on the link that you want to mirror, (I choose the "Manuals online" link on the Free Software Foundation page for this example), and select the Open with.. option. Konqueror will open the window showed in Figure 2, which will you allow to choose "mymirror".

Figure 2. The Open with Option

______________________

Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Smart (Script-Aided) Browsing

Anonymous's picture

Interestingly some news websites like www.ananova.com analyse the http headers received from the client to detect whether the client is a command-line tool like wget or curl or a graphical web browser like Netscape or Explorer. The reply you get from the web server is different depending on the type of client you are using. Presumably this is done deliberately to stop users mirroring content. wget will actually fail to download web pages from www.ananova.com. Perhaps somebody knows of a ready-made command-line client which can simulate the http headers of Netscape et al.Wills (/. 242929) -- unable to login for some reason despite having both cookies and referrer enabled.

reply from the author

Anonymous's picture

The technical solution to your question is here:

http://www.linuxgazette.com/issue70/chung.html

Even Konqueror and other browsers allow you to set the User-Agent string at your will. As Mr chung points out, however, the real problem is:

Warning: the above tip may be considered circumventing a content licensing mechanism and there exist anti-social legal systems that have deemed these actions to be illegal. Check your local legislature. Your mileage may vary.)To me (Marco Fioretti), this is the same as saying that someone will sell to me a VHS tape only if I commit to look at it only with the most expensive VHS player around, from a particular vendor of his choice. The real solution in these cases is to politely tell to the webmaster that he got the whole internet thing backward, that you will never visit that site again, and will encourage everybody you know to do the same.

Best Regards,

Marco Fioretti

www.freesoftware.fsf.org/rule/

complications

Anonymous's picture

I think the situation is more complicated than you suggest. The website www.ananova.com detects which type of client you are browsing with (i.e., command-line tools such as wget and curl, or graphical browsers such as Netscape and Explorer) by looking at the http headers instead of by looking at the user-agent. This can be proved by using the --user-agent option in wget to set the user-agent to any of the typical values for a graphical browser, e.g., Mozilla/4.0 (compatible; MSIE 4.01; Windows 98). If you try this, wget will fail to read any pages from www.ananova.com. To see the differences in the http headers, try watching the communications (when you are browsing www.ananova.com) between your computer and www.ananova.com using something like either tcpdump host www.ananova.com port 80 -w - | strings, or strace -f on the process id of a junkbuster proxy. Graphical browsers like Netscape and Explorer supply http headers such as Proxy-Connection, Accept, Accept-Encoding, Accept-Charset, etc. The command-line tools like wget and curl don't supply these headers. The differences are enough for www.ananova.com to be able to recognise the browser type and react differently in each case.On your second point regarding legality of wget, I do not know how the law applies in your country but in the UK intention is the decisive factor. If your intention is to read a public website, and you have not previously entered a contract to use only one specified method of reading that website, then you are free to use any reading method of your choice on your own computer as long as it causes no loss or damage to anyone else and their property. There is no UK case law on methods of reading a website so UK statutes provide the default legal framework. If, however, your intention was to use wget to copy a website and sell the copied content to others, then you are committing several offences under UK law including one of copyright infringement.Wills (still unable to login despite having enabled both cookies and referrer)

Re: complications

Anonymous's picture

Wills,

thank you for your explanation. I know that some webmasters prefer to spend their time applying such measures, rather than providing portable content.

In such cases, I can't hep but repeat my suggestion to boycott such sites.

This assuming that wget and friends are used only for personal purposes, i.e. reading at your convenience a public website. Copyright infringement is a crime, and must be avoided or prosecuted.

Regards,

Marco Fioretti

(now searching for a wget-like applications providing all the missing headers, will report...)

Re: complications

Anonymous's picture

man wget

[...]

--header=additional-header

Define an additional-header to be passed to the HTTP

servers. Headers must contain a : preceded by one or

more non-blank characters, and must not contain new-

lines.

You may define more than one additional header by

specifying --header more than once.

wget --header='Accept-Charset: iso-8859-2'

--header='Accept-Language: hr'

http://fly.srk.fer.hr/

Specification of an empty string as the header value

will clear all previous user-defined headers.

[...]

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState