Smart (Script-Aided) Browsing
That's it! Now Konqueror will open a Konsole and start the script with the complete URL ("wgetscript.sh http://www.fsf.org/manual/manual.html" in my example). You can browse some other page or do whatever you want, and when you're done, the pages you wanted to read will be available on your hard disk.
As shown in Figure 3, thanks to the -m (mirroring) option, wget will first download and save on disk the URL it was given, then parse it, download all the pages it points to and so on, recursively. Be very cautious with this (or any other automatic web navigation tool, for that matter), and consult the wget manual to tune it to your needs, proxy settings and bandwidth.
When mirroring, wget creates a directory with the same name as the web server (www.fsf.org in this case) and puts everything in there. The last picture, Figure 4, is a listing of that directory made while wget was still working. As you can see, all the subdirectories present on the web site are preserved, and all the relative links are corrected automatically, to allow proper navigation among the mirrored pages.
I have shown in detail how to launch shell scripts directly from Konqueror. How to do this is not one of the most documented features of Konqueror; at least, it's not the easiest one to find. I learned how to do this a couple of years ago, but I since lost my notes and spent half a day on the KDE and Konqueror site without success. I am really grateful to David Faure for giving me all the information I needed.
I am still trying to add this capability to other popular browsers, especially Mozilla and Galeon. I haven't had success so far, because (at least in the versions shipped with Red Hat 7.2) these browsers are missing the "Open with" menu option that made the trick on Konqueror. Any suggestions or pointers to relevant documentation is highly appreciated.
email: linuxdesk@inwind.it
Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Linux Systems Administrator
- Validate an E-Mail Address with PHP, the Right Way
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Introduction to MapReduce with Hadoop on Linux
- RSS Feeds
- One advantage with VMs
13 min 4 sec ago - about info
46 min 13 sec ago - info
47 min 12 sec ago - info
48 min 6 sec ago - info
50 min 11 sec ago - info
51 min 15 sec ago - abut info
52 min 56 sec ago - info
53 min 55 sec ago - info
55 min 27 sec ago - info
56 min 20 sec ago
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?





Comments
Re: Smart (Script-Aided) Browsing
Interestingly some news websites like www.ananova.com analyse the http headers received from the client to detect whether the client is a command-line tool like wget or curl or a graphical web browser like Netscape or Explorer. The reply you get from the web server is different depending on the type of client you are using. Presumably this is done deliberately to stop users mirroring content. wget will actually fail to download web pages from www.ananova.com. Perhaps somebody knows of a ready-made command-line client which can simulate the http headers of Netscape et al.Wills (/. 242929) -- unable to login for some reason despite having both cookies and referrer enabled.
reply from the author
The technical solution to your question is here:
http://www.linuxgazette.com/issue70/chung.html
Even Konqueror and other browsers allow you to set the User-Agent string at your will. As Mr chung points out, however, the real problem is:
Warning: the above tip may be considered circumventing a content licensing mechanism and there exist anti-social legal systems that have deemed these actions to be illegal. Check your local legislature. Your mileage may vary.)To me (Marco Fioretti), this is the same as saying that someone will sell to me a VHS tape only if I commit to look at it only with the most expensive VHS player around, from a particular vendor of his choice. The real solution in these cases is to politely tell to the webmaster that he got the whole internet thing backward, that you will never visit that site again, and will encourage everybody you know to do the same.
Best Regards,
Marco Fioretti
www.freesoftware.fsf.org/rule/
complications
I think the situation is more complicated than you suggest. The website www.ananova.com detects which type of client you are browsing with (i.e., command-line tools such as wget and curl, or graphical browsers such as Netscape and Explorer) by looking at the http headers instead of by looking at the user-agent. This can be proved by using the --user-agent option in wget to set the user-agent to any of the typical values for a graphical browser, e.g., Mozilla/4.0 (compatible; MSIE 4.01; Windows 98). If you try this, wget will fail to read any pages from www.ananova.com. To see the differences in the http headers, try watching the communications (when you are browsing www.ananova.com) between your computer and www.ananova.com using something like either tcpdump host www.ananova.com port 80 -w - | strings, or strace -f on the process id of a junkbuster proxy. Graphical browsers like Netscape and Explorer supply http headers such as Proxy-Connection, Accept, Accept-Encoding, Accept-Charset, etc. The command-line tools like wget and curl don't supply these headers. The differences are enough for www.ananova.com to be able to recognise the browser type and react differently in each case.On your second point regarding legality of wget, I do not know how the law applies in your country but in the UK intention is the decisive factor. If your intention is to read a public website, and you have not previously entered a contract to use only one specified method of reading that website, then you are free to use any reading method of your choice on your own computer as long as it causes no loss or damage to anyone else and their property. There is no UK case law on methods of reading a website so UK statutes provide the default legal framework. If, however, your intention was to use wget to copy a website and sell the copied content to others, then you are committing several offences under UK law including one of copyright infringement.Wills (still unable to login despite having enabled both cookies and referrer)
Re: complications
Wills,
thank you for your explanation. I know that some webmasters prefer to spend their time applying such measures, rather than providing portable content.
In such cases, I can't hep but repeat my suggestion to boycott such sites.
This assuming that wget and friends are used only for personal purposes, i.e. reading at your convenience a public website. Copyright infringement is a crime, and must be avoided or prosecuted.
Regards,
Marco Fioretti
(now searching for a wget-like applications providing all the missing headers, will report...)
Re: complications
man wget
[...]
--header=additional-header
Define an additional-header to be passed to the HTTP
servers. Headers must contain a : preceded by one or
more non-blank characters, and must not contain new-
lines.
You may define more than one additional header by
specifying --header more than once.
wget --header='Accept-Charset: iso-8859-2'
--header='Accept-Language: hr'
http://fly.srk.fer.hr/
Specification of an empty string as the header value
will clear all previous user-defined headers.
[...]