Archiving and Compression

 in
Chapter 8 from Scott Granneman's new book "Linux Phrasebook", the pocket guide every linux user needs. Linux Phrasebook offers a concise reference that, like a language phrasebook, can be used "in the street." The book goes straight to practical Linux uses, providing immediate solutions for day-to-day tasks.
Test Files That Will Be Unzipped with bunzip

     -t

Before bunzipping a file (or files) with bunzip, you might want to verify that they're going to bunzip correctly without any file corruption. To do this, use the -t (or --test) option.

   $ bunzip2 -t paradise_lost.txt.gz
   $

Just as with gunzip, if there's nothing wrong with the archive, bunzip2 doesn't report anything back to you. If there's a problem, you'll know, but if there's not a problem, bunzip2 is silent.

Archive Files with tar

     -cf

Remember, tar doesn't compress; it merely archives (the resulting archives are known as tarballs, by the way). Instead, tar uses other programs, such as gzip or bzip2, to compress the archives that tar creates. Even if you're not going to compress the tarball, you still create it the same way with the same basic options: -c (or --create), which tells tar that you're making a tarball, and -f (or --file), which is the specified filename for the tarball.

   $ ls -l
   scott scott 102519 job.txt
   scott scott 1236574 moby-dick.txt
   scott scott 508925 paradise_lost.txt
   $ tar -cf moby.tar *.txt
   $ ls -l
   scott scott 102519 job.txt
   scott scott 1236574 moby-dick.txt
   scott scott 1853440 moby.tar
   scott scott 508925 paradise_lost.txt

Pay attention to two things here. First, add up the file sizes of job.txt, moby-dick.txt, and paradise_lost.txt, and you get 1848018 bytes. Compare that to the size of moby.tar, and you see that the tarball is only 5422 bytes bigger. Remember that tar is an archive tool, not a compression tool, so the result is at least the same size as the individual files put together, plus a little bit for overhead to keep track of what's in the tarball. Second, notice that tar, unlike gzip and bzip2, leaves the original files behind. This isn't a surprise, considering the tar command's background as a backup tool.

What's really cool about tar is that it's designed to compress entire directory structures, so you can archive a large number of files and subdirectories in one fell swoop.

   $ ls -lF
   drwxr-xr-x scott scott 168 moby-dick/
   $ ls -l moby-dick/*
   scott scott 102519 moby-dick/job.txt
   scott scott 1236574 moby-dick/moby-dick.txt
   scott scott 508925 moby-dick/paradise_lost.txt

   moby-dick/bible:
   scott scott 207254 genesis.txt
   scott scott 102519 job.txt
   $ tar -cf moby.tar moby-dick/
   $ ls -lF
   scott scott   168 moby-dick/
   scott scott 2170880 moby.tar

The tar command has been around forever, and it's obvious why: It's so darn useful! But it gets even more useful when you start factoring in compression tools, as you'll see in the next section.

Archive and Compress Files with tar and gzip

     -zcvf

If you look back at "Archive and Compress Files Using gzip" and "Archive and Compress Files Using bzip2" and think about what was discussed there, you'll probably start to figure out a problem. What if you want to compress a directory that contains 100 files, contained in various subdirectories? If you use gzip or bzip2 with the -r (for recursive) option, you'll end up with 100 individually compressed files, each stored neatly in its original subdirectory. This is undoubtedly not what you want. How would you like to attach 100 .gz or .bz2 files to an email? Yikes!

That's where tar comes in. First you'd use tar to archive the directory and its contents (those 100 files inside various subdirectories) and then you'd use gzip or bzip2 to compress the resulting tarball. Because gzip is the most common compression program used in concert with tar, we'll focus on that.

You could do it this way:

   $ ls -l moby-dick/*
   scott scott 102519 moby-dick/job.txt
   scott scott 1236574 moby-dick/moby-dick.txt
   scott scott 508925 moby-dick/paradise_lost.txt

   moby-dick/bible:
   scott scott 207254 genesis.txt
   scott scott 102519 job.txt
   $ tar -cf moby.tar moby-dick/ | gzip -c > moby.tar.gz
   $ ls -l
   scott scott 168 moby-dick/
   scott scott  20 moby.tar.gz

That method works, but it's just too much typing! There's a much easier way that should be your default. It involves two new options for tar: -z (or --gzip), which invokes gzip from within tar so you don't have to do so manually, and -v (or --verbose), which isn't required here but is always useful, as it keeps you notified as to what tar is doing as it runs.

   $ ls -l moby-dick/*
   scott scott 102519 moby-dick/job.txt
   scott scott 1236574 moby-dick/moby-dick.txt
   scott scott 508925 moby-dick/paradise_lost.txt

   moby-dick/bible:
   scott scott 207254 genesis.txt
   scott scott 102519 job.txt
   $ tar -zcvf moby.tar.gz moby-dick/
   moby-dick/
   moby-dick/job.txt
   moby-dick/bible/
   moby-dick/bible/genesis.txt
   moby-dick/bible/job.txt
   moby-dick/moby-dick.txt
   moby-dick/paradise_lost.txt
   $ ls -l
   scott scott  168 moby-dick
   scott scott 846049 moby.tar.gz

The usual extension for a file that has had the tar and then the gzip commands used on it is .tar.gz; however, you could use .tgz and .tar.gzip if you like.

Note - It's entirely possible to use bzip2 with tar instead of gzip. Your command would look like this (note the -j option, which is where bzip2 comes in):

     $ tar -jcvf moby.tar.bz2 moby-dick/

In that case, the extension should be .tar.bz2, although you may also use .tar.bzip2, .tbz2, or .tbz. Yes, it's very confusing that using gzip or bzip2 might both result in a file ending with .tbz. This is a strong argument for using anything but that particular extension to keep confusion to a minimum.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Unzipping Password Protected Zips

Anonymous's picture

You left out how to unzip ZIP files that are password protected in Linux. I'm searching for this elusive bit of information on the internet right now...

Password protectedly adding files by PHP code was not found

Farrukh Shahzad's picture

Password protectedly adding files by PHP code was not found on the internet when i was searching for it... so i come across your article and it gave me the idea to why not issue a system command by php to add files in zip and even protect the files by password ;)

RAR

Amelia's picture

RAR is good and free too. It supports passwords and can make SFX archives.

No mention of lzma?

Brian Cain's picture

How about rzip or lzma? I recall an article in the print edition within the last ten or eleven issues that compared the cpu overhead of each compression method against compression ratios (and possibly other parameters). Anyways, rzip is memory and cpu intensive, IIRC, but has the potential to make enormous savings. I think it's the same as burrows-wheeler over larger data sets, possibly. Worthwhile for stuff that won't be frequently decompressed, IMO.

rzip

Anonymous's picture

actually rzip levels are in search buffer sizes:

-0 = 100MB
-1 = 100MB
-x = x00MB for x>0 and x<=9

cpu intensive? well depends. I hacked bzip2 compression hooks out of the rzip and it's one of the fastest pre archiving filters with best compression ratio for mysql dump of dbmail database.

yup found bug but only in decompression algorithm - not the data itself. yes - made Andrew to fix it.

Correction to wording

DAKH's picture

Scott,

In the section "Archive Files with tar", paragraph 3, you state that tar is "designed to compress entire directory structures". I think this should read "designed to archive...", since this section deals only with tar's standalone use as an archival tool and since this article/chapter is intended to highlight the difference between archiving and compressing. Other than that, this is a very handy primer on archiving and compressing in *nix.

bzip2 -9

Chris Thompson's picture

The article states that the default block size for bzip2 is -6. The man page for my system (Ubuntu 6.06) states that -9 is the default, and I am unaware of any system where -6 is the default.

TROGDOR STRIKES AGAIN!

TROGDOR's picture

Making -9 the default

Craig Buchek's picture

An easier way to default to the best (-9) compression level would be to export GZIP='-9' and ZIPOPTS='-9' into your environment.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState