Smart (Script-Aided) Browsing

How to mirror a web site from within your browser.
Step 4: Go Have a Nap

That's it! Now Konqueror will open a Konsole and start the script with the complete URL ("wgetscript.sh http://www.fsf.org/manual/manual.html" in my example). You can browse some other page or do whatever you want, and when you're done, the pages you wanted to read will be available on your hard disk.

As shown in Figure 3, thanks to the -m (mirroring) option, wget will first download and save on disk the URL it was given, then parse it, download all the pages it points to and so on, recursively. Be very cautious with this (or any other automatic web navigation tool, for that matter), and consult the wget manual to tune it to your needs, proxy settings and bandwidth.

Figure 3. The wget Mirroring Option

Step 5: Enjoy the Result

When mirroring, wget creates a directory with the same name as the web server (www.fsf.org in this case) and puts everything in there. The last picture, Figure 4, is a listing of that directory made while wget was still working. As you can see, all the subdirectories present on the web site are preserved, and all the relative links are corrected automatically, to allow proper navigation among the mirrored pages.

Figure 4. The Directory wget Mirrored

Conclusion

I have shown in detail how to launch shell scripts directly from Konqueror. How to do this is not one of the most documented features of Konqueror; at least, it's not the easiest one to find. I learned how to do this a couple of years ago, but I since lost my notes and spent half a day on the KDE and Konqueror site without success. I am really grateful to David Faure for giving me all the information I needed.

I am still trying to add this capability to other popular browsers, especially Mozilla and Galeon. I haven't had success so far, because (at least in the versions shipped with Red Hat 7.2) these browsers are missing the "Open with" menu option that made the trick on Konqueror. Any suggestions or pointers to relevant documentation is highly appreciated.

______________________

Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Smart (Script-Aided) Browsing

Anonymous's picture

Interestingly some news websites like www.ananova.com analyse the http headers received from the client to detect whether the client is a command-line tool like wget or curl or a graphical web browser like Netscape or Explorer. The reply you get from the web server is different depending on the type of client you are using. Presumably this is done deliberately to stop users mirroring content. wget will actually fail to download web pages from www.ananova.com. Perhaps somebody knows of a ready-made command-line client which can simulate the http headers of Netscape et al.Wills (/. 242929) -- unable to login for some reason despite having both cookies and referrer enabled.

reply from the author

Anonymous's picture

The technical solution to your question is here:

http://www.linuxgazette.com/issue70/chung.html

Even Konqueror and other browsers allow you to set the User-Agent string at your will. As Mr chung points out, however, the real problem is:

Warning: the above tip may be considered circumventing a content licensing mechanism and there exist anti-social legal systems that have deemed these actions to be illegal. Check your local legislature. Your mileage may vary.)To me (Marco Fioretti), this is the same as saying that someone will sell to me a VHS tape only if I commit to look at it only with the most expensive VHS player around, from a particular vendor of his choice. The real solution in these cases is to politely tell to the webmaster that he got the whole internet thing backward, that you will never visit that site again, and will encourage everybody you know to do the same.

Best Regards,

Marco Fioretti

www.freesoftware.fsf.org/rule/

complications

Anonymous's picture

I think the situation is more complicated than you suggest. The website www.ananova.com detects which type of client you are browsing with (i.e., command-line tools such as wget and curl, or graphical browsers such as Netscape and Explorer) by looking at the http headers instead of by looking at the user-agent. This can be proved by using the --user-agent option in wget to set the user-agent to any of the typical values for a graphical browser, e.g., Mozilla/4.0 (compatible; MSIE 4.01; Windows 98). If you try this, wget will fail to read any pages from www.ananova.com. To see the differences in the http headers, try watching the communications (when you are browsing www.ananova.com) between your computer and www.ananova.com using something like either tcpdump host www.ananova.com port 80 -w - | strings, or strace -f on the process id of a junkbuster proxy. Graphical browsers like Netscape and Explorer supply http headers such as Proxy-Connection, Accept, Accept-Encoding, Accept-Charset, etc. The command-line tools like wget and curl don't supply these headers. The differences are enough for www.ananova.com to be able to recognise the browser type and react differently in each case.On your second point regarding legality of wget, I do not know how the law applies in your country but in the UK intention is the decisive factor. If your intention is to read a public website, and you have not previously entered a contract to use only one specified method of reading that website, then you are free to use any reading method of your choice on your own computer as long as it causes no loss or damage to anyone else and their property. There is no UK case law on methods of reading a website so UK statutes provide the default legal framework. If, however, your intention was to use wget to copy a website and sell the copied content to others, then you are committing several offences under UK law including one of copyright infringement.Wills (still unable to login despite having enabled both cookies and referrer)

Re: complications

Anonymous's picture

Wills,

thank you for your explanation. I know that some webmasters prefer to spend their time applying such measures, rather than providing portable content.

In such cases, I can't hep but repeat my suggestion to boycott such sites.

This assuming that wget and friends are used only for personal purposes, i.e. reading at your convenience a public website. Copyright infringement is a crime, and must be avoided or prosecuted.

Regards,

Marco Fioretti

(now searching for a wget-like applications providing all the missing headers, will report...)

Re: complications

Anonymous's picture

man wget

[...]

--header=additional-header

Define an additional-header to be passed to the HTTP

servers. Headers must contain a : preceded by one or

more non-blank characters, and must not contain new-

lines.

You may define more than one additional header by

specifying --header more than once.

wget --header='Accept-Charset: iso-8859-2'

--header='Accept-Language: hr'

http://fly.srk.fer.hr/

Specification of an empty string as the header value

will clear all previous user-defined headers.

[...]

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState