Downloading an Entire Web Site with wget
If you ever need to download an entire Web site, perhaps for off-line viewing, wget can do the
job—for example:
$ wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains website.org \
--no-parent \
www.website.org/tutorials/html/
This command downloads the Web site www.website.org/tutorials/html/.
The options are:
-
--recursive: download the entire Web site.
-
--domains website.org: don't follow links outside website.org.
-
--no-parent: don't follow links outside the directory tutorials/html/.
-
--page-requisites: get all the elements that compose the page (images, CSS and so on).
-
--html-extension: save files with the .html extension.
-
--convert-links: convert links so that they work locally, off-line.
-
--restrict-file-names=windows: modify filenames so that they will work in Windows as well.
-
--no-clobber: don't overwrite any existing files (used in case the download is interrupted and
resumed).










This week 5 lucky Members will receive a copy of The Official Ubuntu Server Book by Benjamin Mako Hill and Linux Journal's very own Kyle Rankin. No entry necessary. Check back here early next week to find out who the lucky Online Members are.




Comments
wget -m http://website.com
More easy:
wget -m http://website.com
Please use example.com.
Thanks a lot! I needed to mirror a site on our local LAN, and this kept me from having to re familiarize myself with the man page. But PLEASE use example.com for a placeholder domain name. It is reserved for exactly that purpose.
options that you should add to the main article
It would be a VERY good idea to add:
--wait=9 --limit-rate=10Kto your command so you don't kill the server you are trying to download from.
the --wait option introduces a number of seconds to wait between download attempts, the --limit-rate limits the amount of the servers bandwidth you are sucking up. Both good ideas if you don't want to be blacklisted by the servers admin.
Thanks
What options of wget should I use to retrieve all the pages related to the links from a main search page ? I've been trying for days to achieve it, using Linux.
Faleminderit per postimin, ( tr - thanks for posting )
Good work
I was always finding suggestions on appropriate switches to be used when downloading a complete website. This piece of document was very helpful. Thanks and keep up the good work.
Great
Very good instructions
My use
Just to start, this post is most helpfull. Dashamir Hoxha thanks alot!
the reason for writing this is when downloading multiple sites in sequence will take much time. so to easly download multiple sites i set this up. and yes it would be more efficent to put the pipe command in the scrpt file.
what im using it for: dowload multiple websites (manga specificly)
step 1: put the wget command in a script file (for ease of use)
#!/bin/bash
wget -r --page-requisites --convert-links --no-parent -l $2 -U Mozilla $1
ill call mine "meget"
run command: chmod +x meget
is what i put in mine. how to use: [script-name] [target website] [scan depth]
step 2:
make a file with all the websites you want to download - one per line. ill call mine "zone"
step 3: run command:
cat zone | xargs -n1 -P 3 -i ./meget {} 1000
to increase the number of parallel downloads change the 3 to whatever number you need. keep in mind not to have a list of 300 sites and download them all at once - this may cause problems
be sure to also set the 1000 number to the depth you need. in my case to download a 1500 page manga i need to set it upto 1500 or more.
when it is running it will only show one downlaod at a time. if still running it will always show something.
Just what I've been looking for
Just this week I needed to make a site available offline so I can reference to it while working at home. And YaY!! I have wget and love using it already. However, I advise taking note of how wget is saving the files, if it's a site with lots of PHP pages, then you'll have to change the reference in every .php to .php.html ... Not to fear though, your computer can already do the hard work for you. Just type
grep -rl .php *.html | xargs perl -pi~ -e 's/.php/.php.html/'
et voila Your pages will open and link with out a hitch...really interesting and marvelous this Linux thing.
Post new comment