Archiving and Compression

 in
Chapter 8 from Scott Granneman's new book "Linux Phrasebook", the pocket guide every linux user needs. Linux Phrasebook offers a concise reference that, like a language phrasebook, can be used "in the street." The book goes straight to practical Linux uses, providing immediate solutions for day-to-day tasks.
Test Files That Will Be Unzipped with gunzip

     -t

Before gunzipping a file (or files) with gunzip, you might want to verify that they're going to gunzip correctly without any file corruption. To do this, use the -t (or --test) option.

   $ gzip -t paradise_lost.txt.gz
   $

That's right: If nothing is wrong with the archive, gzip reports nothing back to you. If there's a problem, you'll know, but if there's not a problem, gzip is silent. That can be a bit disconcerting, but that's how Unix-based systems work. They're generally only noisy if there's an issue you should know about, not if everything is working as it should.

Archive and Compress Files Using bzip2

     bzip2

Working with bzip2 is pretty easy if you're comfortable with gzip, as the creators of bzip2 deliberately made the options and behavior of the new command as similar to its progenitor as possible.

   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   $ bzip2 moby-dick.txt
   $ ls -l
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2

Just like gzip, bzip2 leaves you with just the .bz2 file. The original moby-dick.txt is gone. To keep the original file, use the -c (or --stdout) option and pipe the output to a filename that ends with .bz2.

   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   $ bzip2 -c moby-dick.txt > moby-dick.txt.bz2
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2

If you look back at "Archive and Compress Files Using gzip," you'll see that gzip and bzip2 are incredibly similar, which is by design.

Get the Best Compression Possible with bzip2

     -[0-9]

Just as with zip and gzip, it's possible to adjust the level of compression that bzip2 uses when it does its job. The bzip2 command uses a scale from 0 to 9, in which 0 means "no compression at all" (which is like tar, as you'll see later), 1 means "do the job quickly, but don't bother compressing very much," and 9 means "compress the heck out of the files, and I don't mind waiting a bit longer to get the job done." The default is 6, but modern computers are fast enough that it's probably just fine to use 9 all the time.

   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   $ bzip2 -c -1 moby-dick.txt > moby-dick.txt.bz2
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 424084 moby-dick.txt.bz2
   $ bzip2 -c -9 moby-dick.txt > moby-dick.txt.bz2
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2

From 424KB with 1 to 367KB with 9 — that's quite a difference! Also notice the difference in ultimate file size between gzip and bzip2. At -9, gzip compressed moby-dick.txt down to 488KB, while bzip2 mashed it even further to 367KB. The bzip2 command is noticeably slower than the gzip command, but on a fast machine that means that bzip2 takes two or three seconds longer than gzip, which frankly isn't much to worry about.

Note - If you want to be clever, define an alias in your .bashrc file that looks like this:

alias bzip2='bzip2 -9'

That way, you'll always use -9 and won't have to think about it.

Uncompress Files Compressed with bzip2

     bunzip2

In the same way that bzip2 was purposely designed to emulate gzip as closely as possible, the way bunzip2 works is very close to that of gunzip.

   $ ls -l
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2
   $ bunzip2 moby-dick.txt.bz2
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt

You'll notice that bunzip2 is similar to gunzip in another way: Both commands remove the original compressed file, leaving you with the final uncompressed result. If you want to ensure that you have both the compressed and uncompressed files, you need to use the -c option (or --stdout or --to-stdout) and pipe the results to the file you want to create.

   $ ls -l
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2
   $ bunzip2 -c moby-dick.txt.bz2 > moby-dick.txt
   $ ls -l
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 367248 moby-dick.txt.bz2

It's a good thing when commands copy each other's options and behavior, as it makes them easier to learn. In this, the creators of bzip2 and bunzip2 showed remarkable foresight.

Note - If you're not feeling favorable toward bunzip2, you can also use bzip2 -d (or --decompress or --uncompress).

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Unzipping Password Protected Zips

Anonymous's picture

You left out how to unzip ZIP files that are password protected in Linux. I'm searching for this elusive bit of information on the internet right now...

Password protectedly adding files by PHP code was not found

Farrukh Shahzad's picture

Password protectedly adding files by PHP code was not found on the internet when i was searching for it... so i come across your article and it gave me the idea to why not issue a system command by php to add files in zip and even protect the files by password ;)

RAR

Amelia's picture

RAR is good and free too. It supports passwords and can make SFX archives.

No mention of lzma?

Brian Cain's picture

How about rzip or lzma? I recall an article in the print edition within the last ten or eleven issues that compared the cpu overhead of each compression method against compression ratios (and possibly other parameters). Anyways, rzip is memory and cpu intensive, IIRC, but has the potential to make enormous savings. I think it's the same as burrows-wheeler over larger data sets, possibly. Worthwhile for stuff that won't be frequently decompressed, IMO.

rzip

Anonymous's picture

actually rzip levels are in search buffer sizes:

-0 = 100MB
-1 = 100MB
-x = x00MB for x>0 and x<=9

cpu intensive? well depends. I hacked bzip2 compression hooks out of the rzip and it's one of the fastest pre archiving filters with best compression ratio for mysql dump of dbmail database.

yup found bug but only in decompression algorithm - not the data itself. yes - made Andrew to fix it.

Correction to wording

DAKH's picture

Scott,

In the section "Archive Files with tar", paragraph 3, you state that tar is "designed to compress entire directory structures". I think this should read "designed to archive...", since this section deals only with tar's standalone use as an archival tool and since this article/chapter is intended to highlight the difference between archiving and compressing. Other than that, this is a very handy primer on archiving and compressing in *nix.

bzip2 -9

Chris Thompson's picture

The article states that the default block size for bzip2 is -6. The man page for my system (Ubuntu 6.06) states that -9 is the default, and I am unaware of any system where -6 is the default.

TROGDOR STRIKES AGAIN!

TROGDOR's picture

Making -9 the default

Craig Buchek's picture

An easier way to default to the best (-9) compression level would be to export GZIP='-9' and ZIPOPTS='-9' into your environment.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState