Smart (Script-Aided) Browsing

How to mirror a web site from within your browser.

Basically, there are two ways to surf the Net: interactively, with any text or graphical browser, or in batch mode, with a program that copies single pages or whole web sites to your hard drive for later use. Script-aided browsing is that part of client side web scripting that makes your use of the Web more efficient and powerful by merging these two techniques in one of the two following ways.

In the first case, you run, either directly or as a dæmon, a script that downloads a web page, extracts from its source code an interesting URL and terminates, thereupon opening your favourite web browser to the corresponding page. Several examples of this first method, applied to Konqueror, Galeon and Netscape (Mozilla uses the same commands as its cousin) have been already described in my article "Client Side Web Scripting", published in the March 2002 issue of Linux Journal.

The second case, also mentioned in that article, is the opposite of the first. That is, during normal interactive web browsing, you notice an hyperlink pointing to an interesting page, and, from within your browser, you launch a web script that will automatically download that page and perform some more or less complex action on it. This action can be anything you can imagine: download all the images contained in that page, list in a pop-up window all the pages it points to and so on. You are limited only by your scripting skills.

Here's an example: mirror a web page and all the pages it points to. Let's assume that you just discovered some new, interesting program. On its home page, a link points directly to the voluminous subsection of the web site containing the complete user manual, and you want to mirror all of the information on your hard disk. The standard tool for these cases is wget, so we don't need to write a new one. However, how do we launch it directly from the web browser, without opening a terminal window and typing the URL by hand? The rest of this article explains how to automate this operation in Konqueror; the example has been tested with the standard KDE, Konqueror and wget tools that come with Red Hat 7.2.

Step 1: Prepare the wget Script

Write a simple shell script that invokes wget with the -m (mirror) option on the first argument and call it (or whatever you want, of course). The content of my script is:

                       /usr/bin/wget -m -L -t 5 -w 5 $1

Put the script in the proper directory (I choose $HOME/bin and make it executable, chmod 755 <filename>.

Step 2: Add the Script to the KDE Application Menu

Following the guidelines in this paragraph of the KDE user guide,, add the script to the KDE menu. Figure 1 shows what I had to write to accomplish this. The string "mymirror" is the one that actually appears in the menu, and the comment is self-explanatory. The really interesting thing in this picture, i.e., the bit of black magic absolutely essential for the correct working of the whole procedure, is the content of the "Command" box:

/home/marco/bin/ %u

Apart from using the complete path to the script, what is important is the %u part; this is what will tell Konqueror to launch the script with the complete URL that we selected as the first argument. Notice also that I checked the Run in terminal option. In this way, a Konsole window will open and run your script, and it will be possible to see what happens.

Figure 1. Adding wget to the KDE Menu

Step 3: Launch the Script

Now, to use this script from Konqueror, you have to right-click on the link that you want to mirror, (I choose the "Manuals online" link on the Free Software Foundation page for this example), and select the Open with.. option. Konqueror will open the window showed in Figure 2, which will you allow to choose "mymirror".

Figure 2. The Open with Option


Articles about Digital Rights and more at CV, talks and bio at


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Smart (Script-Aided) Browsing

Anonymous's picture

Interestingly some news websites like analyse the http headers received from the client to detect whether the client is a command-line tool like wget or curl or a graphical web browser like Netscape or Explorer. The reply you get from the web server is different depending on the type of client you are using. Presumably this is done deliberately to stop users mirroring content. wget will actually fail to download web pages from Perhaps somebody knows of a ready-made command-line client which can simulate the http headers of Netscape et al.Wills (/. 242929) -- unable to login for some reason despite having both cookies and referrer enabled.

reply from the author

Anonymous's picture

The technical solution to your question is here:

Even Konqueror and other browsers allow you to set the User-Agent string at your will. As Mr chung points out, however, the real problem is:

Warning: the above tip may be considered circumventing a content licensing mechanism and there exist anti-social legal systems that have deemed these actions to be illegal. Check your local legislature. Your mileage may vary.)To me (Marco Fioretti), this is the same as saying that someone will sell to me a VHS tape only if I commit to look at it only with the most expensive VHS player around, from a particular vendor of his choice. The real solution in these cases is to politely tell to the webmaster that he got the whole internet thing backward, that you will never visit that site again, and will encourage everybody you know to do the same.

Best Regards,

Marco Fioretti


Anonymous's picture

I think the situation is more complicated than you suggest. The website detects which type of client you are browsing with (i.e., command-line tools such as wget and curl, or graphical browsers such as Netscape and Explorer) by looking at the http headers instead of by looking at the user-agent. This can be proved by using the --user-agent option in wget to set the user-agent to any of the typical values for a graphical browser, e.g., Mozilla/4.0 (compatible; MSIE 4.01; Windows 98). If you try this, wget will fail to read any pages from To see the differences in the http headers, try watching the communications (when you are browsing between your computer and using something like either tcpdump host port 80 -w - | strings, or strace -f on the process id of a junkbuster proxy. Graphical browsers like Netscape and Explorer supply http headers such as Proxy-Connection, Accept, Accept-Encoding, Accept-Charset, etc. The command-line tools like wget and curl don't supply these headers. The differences are enough for to be able to recognise the browser type and react differently in each case.On your second point regarding legality of wget, I do not know how the law applies in your country but in the UK intention is the decisive factor. If your intention is to read a public website, and you have not previously entered a contract to use only one specified method of reading that website, then you are free to use any reading method of your choice on your own computer as long as it causes no loss or damage to anyone else and their property. There is no UK case law on methods of reading a website so UK statutes provide the default legal framework. If, however, your intention was to use wget to copy a website and sell the copied content to others, then you are committing several offences under UK law including one of copyright infringement.Wills (still unable to login despite having enabled both cookies and referrer)

Re: complications

Anonymous's picture


thank you for your explanation. I know that some webmasters prefer to spend their time applying such measures, rather than providing portable content.

In such cases, I can't hep but repeat my suggestion to boycott such sites.

This assuming that wget and friends are used only for personal purposes, i.e. reading at your convenience a public website. Copyright infringement is a crime, and must be avoided or prosecuted.


Marco Fioretti

(now searching for a wget-like applications providing all the missing headers, will report...)

Re: complications

Anonymous's picture

man wget



Define an additional-header to be passed to the HTTP

servers. Headers must contain a : preceded by one or

more non-blank characters, and must not contain new-


You may define more than one additional header by

specifying --header more than once.

wget --header='Accept-Charset: iso-8859-2'

--header='Accept-Language: hr'

Specification of an empty string as the header value

will clear all previous user-defined headers.