Archiving and Compression

 in
Chapter 8 from Scott Granneman's new book "Linux Phrasebook", the pocket guide every linux user needs. Linux Phrasebook offers a concise reference that, like a language phrasebook, can be used "in the street." The book goes straight to practical Linux uses, providing immediate solutions for day-to-day tasks.
Unzip Files

     unzip

Expanding a Zip archive isn't hard at all. To create a zipped archive, use the zip command; to expand that archive, use the unzip command.

   $ unzip moby.zip
   Archive: moby.zip
   inflating: job.txt
   inflating: moby-dick.txt
   inflating: paradise_lost.txt

The unzip command helpfully tells you what it's doing as it works. To get even more information, add the -v option (which stands, of course, for verbose).

    unzip -v moby.zip
   Archive: moby.zip
   Length   Method  Size   Ratio  CRC-32   Name
   -------  ------  ------ -----  ------   ----
   102519   Defl:X   35747  65%  fabf86c9  job.txt
   1236574  Defl:X  487553  61%  34a8cc3a  moby-dick.txt
   508925   Defl:X  224004  56%  6abe1d0f  paradise_lost.t
   -------          ------  ---            -------
   1848018          747304  60%            3 files

There's quite a bit of useful data here, including the method used to compress the files, the ratio of original to compressed file size, and the cyclic redundancy check (CRC) used for error correction.

List Files That Will Be Unzipped

     -l

Sometimes you might find yourself looking at a Zip file and not remembering what's in that file. Or perhaps you want to make sure that a file you need is contained within that Zip file. To list the contents of a zip file without unzipping it, use the -l option (which stands for "list").

   $ unzip -l moby.zip
   Archive: moby.zip
   Length     Date    Time   Name
   --------   ----    ----   ----
         0  01-26-06  18:40  bible/
    207254  01-26-06  18:40  bible/genesis.txt
    102519  01-26-06  18:19  bible/job.txt
   1236574  01-26-06  18:19  moby-dick.txt
    508925  01-26-06  18:19  paradise_lost.txt
   --------                  -------
   2055272                   5 files

From these results, you can see that moby.zip contains two files — moby-dick.txt and paradise_lost.txt — and a directory (bible), which itself contains two files, genesis. txt and job.txt. Now you know exactly what will happen when you expand moby.zip. Using the -l command helps prevent inadvertently unzipping a file that spews out 100 files instead of unzipping a directory that contains 100 files. The first leaves you with files strewn pell-mell, while the second is far easier to handle.

Test Files That Will Be Unzipped

-t

Sometimes zipped archives become corrupted. The worst time to discover this is after you've unzipped the archive and deleted it, only to discover that some or even all of the unzipped contents are damaged and won't open. Better to test the archive first before you actually unzip it by using the -t (for test) option.

   $ unzip -t moby.zip
   Archive: moby.zip
   testing: bible/               OK
   testing: bible/genesis.txt    OK
   testing: bible/job.txt        OK
   testing: moby-dick.txt        OK
   testing: paradise_lost.txt    OK
   No errors detected in compressed data of moby.zip.

You really should use -t every time you work with a zipped file. It's the smart thing to do, and although it might take some extra time, it's worth it in the end.

Archive and Compress Files Using gzip

     gzip

Using gzip is a bit easier than zip in some ways. With zip, you need to specify the name of the newly created Zip file or zip won't work; with gzip, though, you can just type the command and the name of the file you want to compress.

   $ ls -l
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   $ gzip paradise_lost.txt
   $ ls -l
   -rw-r--r-- scott scott 224425 paradise_lost.txt.gz

You should be aware of a very big difference between zip and gzip: When you zip a file, zip leaves the original behind so you have both the original and the newly zipped file, but when you gzip a file, you're left with only the new gzipped file. The original is gone.

If you want gzip to leave behind the original file, you need to use the -c (or --stdout or --to-stdout) option, which outputs the results of gzip to the shell, but you need to redirect that output to another file. If you use -c and forget to redirect your output, you get nonsense like this:

Not good. Instead, output to a file.

   $ls -l
   -rw-r--r-- 1 scott scott 508925 paradise_lost.txt
   $ gzip -c paradise_lost.txt > paradise_lost.txt.gz
   $ ls -l
   -rw-r--r-- 1 scott scott 497K paradise_lost.txt
   -rw-r--r-- 1 scott scott 220K paradise_lost.txt.gz

Much better! Now you have both your original file and the zipped version.

Tip: If you accidentally use the -c option without specifying an output file, just start pressing Ctrl+C several times until gzip stops.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Unzipping Password Protected Zips

Anonymous's picture

You left out how to unzip ZIP files that are password protected in Linux. I'm searching for this elusive bit of information on the internet right now...

Password protectedly adding files by PHP code was not found

Farrukh Shahzad's picture

Password protectedly adding files by PHP code was not found on the internet when i was searching for it... so i come across your article and it gave me the idea to why not issue a system command by php to add files in zip and even protect the files by password ;)

RAR

Amelia's picture

RAR is good and free too. It supports passwords and can make SFX archives.

No mention of lzma?

Brian Cain's picture

How about rzip or lzma? I recall an article in the print edition within the last ten or eleven issues that compared the cpu overhead of each compression method against compression ratios (and possibly other parameters). Anyways, rzip is memory and cpu intensive, IIRC, but has the potential to make enormous savings. I think it's the same as burrows-wheeler over larger data sets, possibly. Worthwhile for stuff that won't be frequently decompressed, IMO.

rzip

Anonymous's picture

actually rzip levels are in search buffer sizes:

-0 = 100MB
-1 = 100MB
-x = x00MB for x>0 and x<=9

cpu intensive? well depends. I hacked bzip2 compression hooks out of the rzip and it's one of the fastest pre archiving filters with best compression ratio for mysql dump of dbmail database.

yup found bug but only in decompression algorithm - not the data itself. yes - made Andrew to fix it.

Correction to wording

DAKH's picture

Scott,

In the section "Archive Files with tar", paragraph 3, you state that tar is "designed to compress entire directory structures". I think this should read "designed to archive...", since this section deals only with tar's standalone use as an archival tool and since this article/chapter is intended to highlight the difference between archiving and compressing. Other than that, this is a very handy primer on archiving and compressing in *nix.

bzip2 -9

Chris Thompson's picture

The article states that the default block size for bzip2 is -6. The man page for my system (Ubuntu 6.06) states that -9 is the default, and I am unaware of any system where -6 is the default.

TROGDOR STRIKES AGAIN!

TROGDOR's picture

Making -9 the default

Craig Buchek's picture

An easier way to default to the best (-9) compression level would be to export GZIP='-9' and ZIPOPTS='-9' into your environment.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState