Archiving and Compression

by Scott Granneman
Chapter 8: Archiving and Compression

Although the differences are sometimes made opaque in casual conversation, there is in fact a complete difference between archiving files and compressing them. Archiving means that you take 10 files and combine them into one file, with no difference in size. If you start with 10 100KB files and archive them, the resulting single file is 1000KB. On the other hand, if you compress those 10 files, you might find that the resulting files range from only a few kilobytes to close to the original size of 100KB, depending upon the original file type.

Note - In fact, you might end up with a bigger file during compression! If the file is already compressed, compressing it again adds extra overhead, resulting in a slightly bigger file.

All of the archive and compression formats in this chapter — zip, gzip, bzip2, and tar — are popular, but zip is probably the world's most widely used format. That's because of its almost universal use on Windows, but zip and unzip are well supported among all major (and most minor) operating systems, so things compressed using zip also work on Linux and Mac OS. If you're sending archives out to users and you don't know which operating systems they're using, zip is a safe choice to make.

gzip was designed as an open-source replacement for an older Unix program, compress. It's found on virtually every Unix-based system in the world, including Linux and Mac OS X, but it is much less common on Windows. If you're sending files back and forth to users of Unix-based machines, gzip is a safe choice.

The bzip2 command is the new kid on the block. Designed to supersede gzip, bzip2 creates smaller files, but at the cost of speed. That said, computers are so fast nowadays that most users won't notice much of a difference between the times it takes gzip or bzip2 to compress a group of files.

Note - Linux Magazine published a good article comparing several different compression formats, which you can find at www.linux-mag.com/content/view/1678/43/.

zip, gzip, and bzip2 are focused on compression (although zip also archives). The tar command does one thing — archive — and it has been doing it for a long time. It's found almost solely on Unix-based machines. You'll definitely run into tar files (also called tarballs) if you download source code, but almost every Linux user can expect to encounter a tarball some time in his career.

Archive and Compress Files Using zip

     zip

zip both archives and compresses files, thus making it great for sending multiple files as email attachments, backing up items, or for saving disk space. Using it is simple. Let's say you want to send a TIFF to someone via email. A TIFF image is uncompressed, so it tends to be pretty large. Zipping it up should help make the email attachment a bit smaller.

Note - When using ls -l, I'm only showing the information needed for each example.

   $ ls -lh
   -rw-r--r-- scott scott 1006K young_edgar_scott.tif
   $ zip grandpa.zip young_edgar_scott.tif
   adding: young_edgar_scott.tif (deflated 19%)
   $ ls -lh
   -rw-r--r-- scott scott 1006K young_edgar_scott.tif
   -rw-r--r-- scott scott 819K grandpa.zip
   _grandpa.zip

In this case, you shaved off about 200KB on the resulting zip file, or 19%, as zip helpfully informs you. Not bad. You can do the same thing for several images.

   $ ls -l
   -rw-r--r-- scott scott 251980 edgar_intl_shoe.tif
   -rw-r--r-- scott scott 1130922 edgar_baby.tif
   -rw-r--r-- scott scott 1029224 young_edgar_scott.tif
   $ zip grandpa.zip edgar_intl_shoe.tif edgar_baby.tif young_edgar_scott.tif
   adding: edgar_intl_shoe.tif (deflated 4%)
   adding: edgar_baby.tif (deflated 12%)
   adding: young_edgar_scott.tif (deflated 19%)
   $ ls -l
   -rw-r--r-- scott scott 251980 edgar_intl_shoe.tif
   -rw-r--r-- scott scott 1130922 edgar_baby.tif
   -rw-r--r-- scott scott 2074296 grandpa.zip
   -rw-r--r-- scott scott 1029224 young_edgar_scott.tif

It's not too polite, however, to zip up individual files this way. For three files, it's not so bad. The recipient will unzip grandpa.zip and end up with three individual files. If the payload was 50 files, however, the user would end up with files strewn everywhere. Better to zip up a directory containing those 50 files so when the user unzips it, he's left with a tidy directory instead.

   $ ls -lF 
   drwxr-xr-x scott scott edgar_scott/
   $ zip grandpa.zip edgar_scott
   adding: edgar_scott/ (stored 0%)
   adding: edgar_scott/edgar_baby.tif (deflated 12%)
   adding: edgar_scott/young_edgar_scott.tif (deflated 19%)
   adding: edgar_scott/edgar_intl_shoe.tif (deflated 4%)
   $ ls -lF
   drwxr-xr-x scott scott   160 edgar_scott/
   -rw-r--r-- scott scott 2074502 grandpa.zip

Whether you're zipping up a file, several files, or a directory, the pattern is the same: the zip command, followed by the name of the Zip file you're creating, and finished with the item(s) you're adding to the Zip file.

Get the Best Compression Possible with zip

     -[0-9]

It's possible to adjust the level of compression that zip uses when it does its job. The zip command uses a scale from 0 to 9, in which 0 means "no compression at all" (which is like tar, as you'll see later), 1 means "do the job quickly, but don't bother compressing very much," and 9 means "compress the heck out of the files, and I don't mind waiting a bit longer to get the job done." The default is 6, but modern computers are fast enough that it's probably just fine to use 9 all the time.

Say you're interested in researching Herman Melville's Moby-Dick, so you want to collect key texts to help you understand the book: Moby-Dick itself, Milton's Paradise Lost, and the Bible's book of Job. Let's compare the results of different compression rates.

   $ ls -l
   -rw-r--r-- scott scott 102519 job.txt
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   $ zip -0 moby.zip *.txt
   adding: job.txt (stored 0%)
   adding: moby-dick.txt (stored 0%)
   adding: paradise_lost.txt (stored 0%)
   $ ls -l
   -rw-r--r-- scott scott 102519 job.txt
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 1848444 moby.zip
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   $ zip -1 moby.zip *txt
   updating: job.txt (deflated 58%)
   updating: moby-dick.txt (deflated 54%)
   updating: paradise_lost.txt (deflated 50%)
   $ ls -l
   -rw-r--r-- scott scott 102519 job.txt
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 869946 moby.zip
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   $ zip -9 moby.zip *txt
   updating: job.txt (deflated 65%)
   updating: moby-dick.txt (deflated 61%)
   updating: paradise_lost.txt (deflated 56%)
   $ ls -l
   -rw-r--r-- scott scott 102519 job.txt
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 747730 moby.zip
   -rw-r--r-- scott scott 508925 paradise_lost.txt

In tabular format, the results look like this:

Bookzip -0zip -1zip -9
Moby-Dick0%54%61%
Paradise Lost0%50%56%
Job0%58%65%
Total (in bytes)1848444869946747730

The results you see here would vary depending on the file types (text files typically compress well) and the sizes of the original files, but this gives you a good idea of what you can expect. Unless you have a really slow machine or you're just naturally impatient, you should just use -9 all the time to get the maximum compression.

Note - If you want to be clever, define an alias in your .bashrc file that looks like this:

alias zip='zip -9'

That way you'll always use -9 and won't have to think about it.

Password-Protect Compressed Zip Archives

     -P

     -e

The Zip program allows you to password-protect your Zip archives using the -P option. You shouldn't use this option. It's completely insecure, as you can see in the following example (the actual password is 12345678):

   $ zip -P 12345678 moby.zip *.txt

Because you had to specify the password on the command line, anyone viewing your shell's history (and you might be surprised how easy it is for other users to do so) can see your password in all its glory. Don't use the -P option!

Instead, just use the -e option, which encrypts the contents of your Zip file and also uses a password. The difference, however, is that you're prompted to type the password in, so it won't be saved in the history of your shell events.

   $ zip -e moby.zip *.txt
   Enter password:
   Verify password:
   adding: job.txt (deflated 65%)
   adding: moby-dick.txt (deflated 61%)
   adding: paradise_lost.txt (deflated 56%)

The only part of this that's saved in the shell is zip -e moby.zip *.txt. The actual password you type disappears into the ether, unavailable to anyone viewing your shell history.

Caution - The security offered by the Zip program's password protection isn't that great. In fact, it's pretty easy to find a multitude of tools floating around the Internet that can quickly crack a password-protected Zip archive. Think of password-protecting a Zip file as the difference between writing a message on a postcard and sealing it in an envelope: It's good enough for ordinary folks, but it won't stop a determined attacker.

Also, the version of zip included with some Linux distros may not support encryption, in which case you'll see a zip error: "encryption not supported." The only solution: recompile zip from source. Ugh.

Unzip Files

     unzip

Expanding a Zip archive isn't hard at all. To create a zipped archive, use the zip command; to expand that archive, use the unzip command.

   $ unzip moby.zip
   Archive: moby.zip
   inflating: job.txt
   inflating: moby-dick.txt
   inflating: paradise_lost.txt

The unzip command helpfully tells you what it's doing as it works. To get even more information, add the -v option (which stands, of course, for verbose).

    unzip -v moby.zip
   Archive: moby.zip
   Length   Method  Size   Ratio  CRC-32   Name
   -------  ------  ------ -----  ------   ----
   102519   Defl:X   35747  65%  fabf86c9  job.txt
   1236574  Defl:X  487553  61%  34a8cc3a  moby-dick.txt
   508925   Defl:X  224004  56%  6abe1d0f  paradise_lost.t
   -------          ------  ---            -------
   1848018          747304  60%            3 files

There's quite a bit of useful data here, including the method used to compress the files, the ratio of original to compressed file size, and the cyclic redundancy check (CRC) used for error correction.

List Files That Will Be Unzipped

     -l

Sometimes you might find yourself looking at a Zip file and not remembering what's in that file. Or perhaps you want to make sure that a file you need is contained within that Zip file. To list the contents of a zip file without unzipping it, use the -l option (which stands for "list").

   $ unzip -l moby.zip
   Archive: moby.zip
   Length     Date    Time   Name
   --------   ----    ----   ----
         0  01-26-06  18:40  bible/
    207254  01-26-06  18:40  bible/genesis.txt
    102519  01-26-06  18:19  bible/job.txt
   1236574  01-26-06  18:19  moby-dick.txt
    508925  01-26-06  18:19  paradise_lost.txt
   --------                  -------
   2055272                   5 files

From these results, you can see that moby.zip contains two files — moby-dick.txt and paradise_lost.txt — and a directory (bible), which itself contains two files, genesis. txt and job.txt. Now you know exactly what will happen when you expand moby.zip. Using the -l command helps prevent inadvertently unzipping a file that spews out 100 files instead of unzipping a directory that contains 100 files. The first leaves you with files strewn pell-mell, while the second is far easier to handle.

Test Files That Will Be Unzipped

-t

Sometimes zipped archives become corrupted. The worst time to discover this is after you've unzipped the archive and deleted it, only to discover that some or even all of the unzipped contents are damaged and won't open. Better to test the archive first before you actually unzip it by using the -t (for test) option.

   $ unzip -t moby.zip
   Archive: moby.zip
   testing: bible/               OK
   testing: bible/genesis.txt    OK
   testing: bible/job.txt        OK
   testing: moby-dick.txt        OK
   testing: paradise_lost.txt    OK
   No errors detected in compressed data of moby.zip.

You really should use -t every time you work with a zipped file. It's the smart thing to do, and although it might take some extra time, it's worth it in the end.

Archive and Compress Files Using gzip

     gzip

Using gzip is a bit easier than zip in some ways. With zip, you need to specify the name of the newly created Zip file or zip won't work; with gzip, though, you can just type the command and the name of the file you want to compress.

   $ ls -l
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   $ gzip paradise_lost.txt
   $ ls -l
   -rw-r--r-- scott scott 224425 paradise_lost.txt.gz

You should be aware of a very big difference between zip and gzip: When you zip a file, zip leaves the original behind so you have both the original and the newly zipped file, but when you gzip a file, you're left with only the new gzipped file. The original is gone.

If you want gzip to leave behind the original file, you need to use the -c (or --stdout or --to-stdout) option, which outputs the results of gzip to the shell, but you need to redirect that output to another file. If you use -c and forget to redirect your output, you get nonsense like this:

Archiving and Compression

Not good. Instead, output to a file.

   $ls -l
   -rw-r--r-- 1 scott scott 508925 paradise_lost.txt
   $ gzip -c paradise_lost.txt > paradise_lost.txt.gz
   $ ls -l
   -rw-r--r-- 1 scott scott 497K paradise_lost.txt
   -rw-r--r-- 1 scott scott 220K paradise_lost.txt.gz

Much better! Now you have both your original file and the zipped version.

Tip: If you accidentally use the -c option without specifying an output file, just start pressing Ctrl+C several times until gzip stops.

Archive and Compress Files Recursively Using gzip

     -r

If you want to use gzip on several files in a directory, just use a wildcard. You might not end up gzipping everything you think you will, however, as this example shows.

   $ ls -F
   bible/ moby-dick.txt paradise_lost.txt
   $ ls -l *
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 508925 paradise_lost.txt

   bible:
   -rw-r--r-- scott scott 207254 genesis.txt
   -rw-r--r-- scott scott 102519 job.txt
   $ gzip *
   gzip: bible is a directory -- ignored
   $ ls -l *
   -rw-r--r-- scott scott 489609 moby-dick.txt.gz
   -rw-r--r-- scott scott 224425 paradise_lost.txt.gz

   bible:
   -rw-r--r-- scott scott 207254 genesis.txt
   -rw-r--r-- scott scott 102519 job.txt

Notice that the wildcard didn't do anything for the files inside the bible directory because gzip by default doesn't walk down into subdirectories. To get that behavior, you need to use the -r (or --recursive) option along with your wildcard.

   $ ls -F
   bible/ moby-dick.txt paradise_lost.txt
   $ ls -l *
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 508925 paradise_lost.txt

   bible:
   -rw-r--r-- scott scott 207254 genesis.txt
   -rw-r--r-- scott scott 102519 job.txt
   $ gzip -r *
   $ ls -l *
   -rw-r--r-- scott scott 489609 moby-dick.txt.gz
   -rw-r--r-- scott scott 224425 paradise_lost.txt.gz

   bible:
   -rw-r--r-- scott scott 62114 genesis.txt.gz
   -rw-r--r-- scott scott 35984 job.txt.gz

This time, every file — even those in subdirectories — was gzipped. However, note that each file is individually gzipped. The gzip command cannot combine all the files into one big file, like you can with the zip command. To do that, you need to incorporate tar, as you'll see in "Archive and Compress Files with tar and gzip."

Get the Best Compression Possible with gzip

     -[0-9]

Just as with zip, it's possible to adjust the level of compression that gzip uses when it does its job. The gzip command uses a scale from 0 to 9, in which 0 means "no compression at all" (which is like tar, as you'll see later), 1 means "do the job quickly, but don't bother compressing very much," and 9 means "compress the heck out of the files, and I don't mind waiting a bit longer to get the job done." The default is 6, but modern computers are fast enough that it's probably just fine to use 9 all the time.

   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   $ gzip -c -1 moby-dick.txt > moby-dick.txt.gz
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 571005 moby-dick.txt.gz
   $ gzip -c -9 moby-dick.txt > moby-dick.txt.gz
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 487585 moby-dick.txt.gz

Remember to use the -c option and pipe the output into the actual .gz file due to the way gzip works, as discussed in "Archive and Compress Files Using gzip."

Note - If you want to be clever, define an alias in your .bashrc file that looks like this:

alias gzip='gzip -9'

That way, you'll always use -9 and won't have to think about it.

Uncompress Files Compressed with gzip

     gunzip

Getting files out of a gzipped archive is easy with the gunzip command.

   $ ls -l
   -rw-r--r-- scott scott 224425 paradise_lost.txt.gz
   $ gunzip paradise_lost.txt.gz
   $ ls -l
   -rw-r--r-- scott scott 508925 paradise_lost.txt

In the same way that gzip removes the original file, leaving you solely with the gzipped result, gunzip removes the .gz file, leaving you with the final gunzipped result. If you want to ensure that you have both, you need to use the -c option (or --stdout or --to-stdout) and pipe the results to the file you want to create.

   $ ls -l
   -rw-r--r-- scott scott 224425 paradise_lost.txt.gz
   $ gunzip -c paradise_lost.txt.gz > paradise_lost.txt
   $ ls -l
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   -rw-r--r-- scott scott 224425 paradise_lost.txt.gz

It's probably a good idea to use -c, especially if you plan to keep behind the .gz file or pass it along to someone else. Sure, you could use gzip and create your own archive, but why go to the extra work?

Note - If you don't like the gunzip command, you can also use gzip -d (or --decompress or --uncompress).

Test Files That Will Be Unzipped with gunzip

     -t

Before gunzipping a file (or files) with gunzip, you might want to verify that they're going to gunzip correctly without any file corruption. To do this, use the -t (or --test) option.

   $ gzip -t paradise_lost.txt.gz
   $

That's right: If nothing is wrong with the archive, gzip reports nothing back to you. If there's a problem, you'll know, but if there's not a problem, gzip is silent. That can be a bit disconcerting, but that's how Unix-based systems work. They're generally only noisy if there's an issue you should know about, not if everything is working as it should.

Archive and Compress Files Using bzip2

     bzip2

Working with bzip2 is pretty easy if you're comfortable with gzip, as the creators of bzip2 deliberately made the options and behavior of the new command as similar to its progenitor as possible.

   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   $ bzip2 moby-dick.txt
   $ ls -l
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2

Just like gzip, bzip2 leaves you with just the .bz2 file. The original moby-dick.txt is gone. To keep the original file, use the -c (or --stdout) option and pipe the output to a filename that ends with .bz2.

   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   $ bzip2 -c moby-dick.txt > moby-dick.txt.bz2
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2

If you look back at "Archive and Compress Files Using gzip," you'll see that gzip and bzip2 are incredibly similar, which is by design.

Get the Best Compression Possible with bzip2

     -[0-9]

Just as with zip and gzip, it's possible to adjust the level of compression that bzip2 uses when it does its job. The bzip2 command uses a scale from 0 to 9, in which 0 means "no compression at all" (which is like tar, as you'll see later), 1 means "do the job quickly, but don't bother compressing very much," and 9 means "compress the heck out of the files, and I don't mind waiting a bit longer to get the job done." The default is 6, but modern computers are fast enough that it's probably just fine to use 9 all the time.

   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   $ bzip2 -c -1 moby-dick.txt > moby-dick.txt.bz2
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 424084 moby-dick.txt.bz2
   $ bzip2 -c -9 moby-dick.txt > moby-dick.txt.bz2
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2

From 424KB with 1 to 367KB with 9 — that's quite a difference! Also notice the difference in ultimate file size between gzip and bzip2. At -9, gzip compressed moby-dick.txt down to 488KB, while bzip2 mashed it even further to 367KB. The bzip2 command is noticeably slower than the gzip command, but on a fast machine that means that bzip2 takes two or three seconds longer than gzip, which frankly isn't much to worry about.

Note - If you want to be clever, define an alias in your .bashrc file that looks like this:

alias bzip2='bzip2 -9'

That way, you'll always use -9 and won't have to think about it.

Uncompress Files Compressed with bzip2

     bunzip2

In the same way that bzip2 was purposely designed to emulate gzip as closely as possible, the way bunzip2 works is very close to that of gunzip.

   $ ls -l
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2
   $ bunzip2 moby-dick.txt.bz2
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt

You'll notice that bunzip2 is similar to gunzip in another way: Both commands remove the original compressed file, leaving you with the final uncompressed result. If you want to ensure that you have both the compressed and uncompressed files, you need to use the -c option (or --stdout or --to-stdout) and pipe the results to the file you want to create.

   $ ls -l
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2
   $ bunzip2 -c moby-dick.txt.bz2 > moby-dick.txt
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2

It's a good thing when commands copy each other's options and behavior, as it makes them easier to learn. In this, the creators of bzip2 and bunzip2 showed remarkable foresight.

Note - If you're not feeling favorable toward bunzip2, you can also use bzip2 -d (or --decompress or --uncompress).

Test Files That Will Be Unzipped with bunzip

     -t

Before bunzipping a file (or files) with bunzip, you might want to verify that they're going to bunzip correctly without any file corruption. To do this, use the -t (or --test) option.

   $ bunzip2 -t paradise_lost.txt.gz
   $

Just as with gunzip, if there's nothing wrong with the archive, bunzip2 doesn't report anything back to you. If there's a problem, you'll know, but if there's not a problem, bunzip2 is silent.

Archive Files with tar

     -cf

Remember, tar doesn't compress; it merely archives (the resulting archives are known as tarballs, by the way). Instead, tar uses other programs, such as gzip or bzip2, to compress the archives that tar creates. Even if you're not going to compress the tarball, you still create it the same way with the same basic options: -c (or --create), which tells tar that you're making a tarball, and -f (or --file), which is the specified filename for the tarball.

   $ ls -l
   scott scott 102519 job.txt
   scott scott 1236574 moby-dick.txt
   scott scott 508925 paradise_lost.txt
   $ tar -cf moby.tar *.txt
   $ ls -l
   scott scott 102519 job.txt
   scott scott 1236574 moby-dick.txt
   scott scott 1853440 moby.tar
   scott scott 508925 paradise_lost.txt

Pay attention to two things here. First, add up the file sizes of job.txt, moby-dick.txt, and paradise_lost.txt, and you get 1848018 bytes. Compare that to the size of moby.tar, and you see that the tarball is only 5422 bytes bigger. Remember that tar is an archive tool, not a compression tool, so the result is at least the same size as the individual files put together, plus a little bit for overhead to keep track of what's in the tarball. Second, notice that tar, unlike gzip and bzip2, leaves the original files behind. This isn't a surprise, considering the tar command's background as a backup tool.

What's really cool about tar is that it's designed to compress entire directory structures, so you can archive a large number of files and subdirectories in one fell swoop.

   $ ls -lF
   drwxr-xr-x scott scott 168 moby-dick/
   $ ls -l moby-dick/*
   scott scott 102519 moby-dick/job.txt
   scott scott 1236574 moby-dick/moby-dick.txt
   scott scott 508925 moby-dick/paradise_lost.txt

   moby-dick/bible:
   scott scott 207254 genesis.txt
   scott scott 102519 job.txt
   $ tar -cf moby.tar moby-dick/
   $ ls -lF
   scott scott   168 moby-dick/
   scott scott 2170880 moby.tar

The tar command has been around forever, and it's obvious why: It's so darn useful! But it gets even more useful when you start factoring in compression tools, as you'll see in the next section.

Archive and Compress Files with tar and gzip

     -zcvf

If you look back at "Archive and Compress Files Using gzip" and "Archive and Compress Files Using bzip2" and think about what was discussed there, you'll probably start to figure out a problem. What if you want to compress a directory that contains 100 files, contained in various subdirectories? If you use gzip or bzip2 with the -r (for recursive) option, you'll end up with 100 individually compressed files, each stored neatly in its original subdirectory. This is undoubtedly not what you want. How would you like to attach 100 .gz or .bz2 files to an email? Yikes!

That's where tar comes in. First you'd use tar to archive the directory and its contents (those 100 files inside various subdirectories) and then you'd use gzip or bzip2 to compress the resulting tarball. Because gzip is the most common compression program used in concert with tar, we'll focus on that.

You could do it this way:

   $ ls -l moby-dick/*
   scott scott 102519 moby-dick/job.txt
   scott scott 1236574 moby-dick/moby-dick.txt
   scott scott 508925 moby-dick/paradise_lost.txt

   moby-dick/bible:
   scott scott 207254 genesis.txt
   scott scott 102519 job.txt
   $ tar -cf moby.tar moby-dick/ | gzip -c > moby.tar.gz
   $ ls -l
   scott scott 168 moby-dick/
   scott scott  20 moby.tar.gz

That method works, but it's just too much typing! There's a much easier way that should be your default. It involves two new options for tar: -z (or --gzip), which invokes gzip from within tar so you don't have to do so manually, and -v (or --verbose), which isn't required here but is always useful, as it keeps you notified as to what tar is doing as it runs.

   $ ls -l moby-dick/*
   scott scott 102519 moby-dick/job.txt
   scott scott 1236574 moby-dick/moby-dick.txt
   scott scott 508925 moby-dick/paradise_lost.txt

   moby-dick/bible:
   scott scott 207254 genesis.txt
   scott scott 102519 job.txt
   $ tar -zcvf moby.tar.gz moby-dick/
   moby-dick/
   moby-dick/job.txt
   moby-dick/bible/
   moby-dick/bible/genesis.txt
   moby-dick/bible/job.txt
   moby-dick/moby-dick.txt
   moby-dick/paradise_lost.txt
   $ ls -l
   scott scott  168 moby-dick
   scott scott 846049 moby.tar.gz

The usual extension for a file that has had the tar and then the gzip commands used on it is .tar.gz; however, you could use .tgz and .tar.gzip if you like.

Note - It's entirely possible to use bzip2 with tar instead of gzip. Your command would look like this (note the -j option, which is where bzip2 comes in):

     $ tar -jcvf moby.tar.bz2 moby-dick/

In that case, the extension should be .tar.bz2, although you may also use .tar.bzip2, .tbz2, or .tbz. Yes, it's very confusing that using gzip or bzip2 might both result in a file ending with .tbz. This is a strong argument for using anything but that particular extension to keep confusion to a minimum.

Test Files That Will Be Untarred and Uncompressed

     -zvtf

Before you take apart a tarball (whether or not it was also compressed using gzip), it's a really good idea to test it. First, you'll know if the tarball is corrupted, saving yourself hair pulling when files don't seem to work. Second, you'll know if the person who created the tarball thoughtfully tarred up a directory containing 100 files, or instead thoughtlessly tarred up 100 individual files, which you're just about to spew all over your desktop.

To test your tarball (once again assuming it was also zipped using gzip), use the -t (or --list) option.

   $ tar -zvtf moby.tar.gz
   scott/scott 0 moby-dick/
   scott/scott 102519 moby-dick/job.txt
   scott/scott 0 moby-dick/bible/
   scott/scott 207254 moby-dick/bible/genesis.txt
   scott/scott 102519 moby-dick/bible/job.txt
   scott/scott 1236574 moby-dick/moby-dick.txt
   scott/scott 508925 moby-dick/paradise_lost.txt

This tells you the permissions, ownership, file size, and time for each file. In addition, because every line begins with moby-dick/, you can see that you're going to end up with a directory that contains within it all the files and subdirectories that accompany the tarball, which is a relief.

Be sure that the -f is the last option because after that you're going to specify the name of the .tar.gz file. If you don't, tar complains:

   $ tar -zvft moby.tar.gz
   tar: You must specify one of the '-Acdtrux' options
   Try 'tar --help' or 'tar --usage' for more information.

Now that you've ensured that your .tar.gz file isn't corrupted, it's time to actually open it up, as you'll see in the following section.

Note - If you're testing a tarball that was compressed using bzip2, just use this command instead:

     $ tar -jvtf moby.tar.bz2

Untar and Uncompress Files

     -zxvf

To create a .tar.gz file, you used a set of options: -zcvf. To untar and uncompress the resulting file, you only make one substitution: -x (or --extract) for -c (or --create).

   $ ls -l
   rsgranne rsgranne 846049 moby.tar.gz
   $ tar -zxvf moby.tar.gz
   moby-dick/
   moby-dick/job.txt
   moby-dick/bible/
   moby-dick/bible/genesis.txt
   moby-dick/bible/job.txt
   moby-dick/moby-dick.txt
   moby-dick/paradise_lost.txt
   $ ls -l
   rsgranne rsgranne  168 moby-dick
   rsgranne rsgranne 846049 moby.tar.gz

Make sure you always test the file before you open it, as covered in the previous section, "Test Files That Will Be Untarred and Uncompressed." That means the order of commands you should run will look like this:

   $ tar -zvtf moby.tar.gz
   $ tar -zxvf moby.tar.gz

Note - If you're opening a tarball that was compressed using bzip2, just use this command instead:

      $ tar -jxvf moby.tar.bz2

Conclusion

Back in the days of slow modems and tiny hard drives, archiving and compression was a necessity. These days, it's more of a convenience, but it's still something you'll find yourself using all the time. For instance, if you ever download source code to compile it, more than likely you'll find yourself face-to-face with a file such as sourcecode.tar.gz. In the future, you'll probably see more and more of those files ending with .tar.bz2. And if you exchange files with Windows users, you're going to run into files that end with .zip. Learn how to use your archival and compression tools because you're going to be using them far more than you think.

About the Author:

Scott Granneman is a monthly columnist for SecurityFocus and Linux Magazine, as well as a professional blogger on The Open Source Weblog. He is an adjunct Professor at Washington University, St. Louis and at Webster University, teaching a variety of courses about technology and the Internet.

Archiving and Compression


         "Linux Phrasebook" by Scott Granneman
         ISBN: 0-672-32838-0
         http://www.samspublishing.com/bookstore/product.asp?isbn=0672328380&rl=1
         © Copyright Pearson Education.  All rights reserved.
         Chapter excerpt provided by Sams Publishing an imprint of Pearson Education

         Reprinted with permission.
      

Load Disqus comments

Firstwave Cloud