Archiving and Compression

 in
Chapter 8 from Scott Granneman's new book "Linux Phrasebook", the pocket guide every linux user needs. Linux Phrasebook offers a concise reference that, like a language phrasebook, can be used "in the street." The book goes straight to practical Linux uses, providing immediate solutions for day-to-day tasks.
Get the Best Compression Possible with zip

     -[0-9]

It's possible to adjust the level of compression that zip uses when it does its job. The zip command uses a scale from 0 to 9, in which 0 means "no compression at all" (which is like tar, as you'll see later), 1 means "do the job quickly, but don't bother compressing very much," and 9 means "compress the heck out of the files, and I don't mind waiting a bit longer to get the job done." The default is 6, but modern computers are fast enough that it's probably just fine to use 9 all the time.

Say you're interested in researching Herman Melville's Moby-Dick, so you want to collect key texts to help you understand the book: Moby-Dick itself, Milton's Paradise Lost, and the Bible's book of Job. Let's compare the results of different compression rates.

   $ ls -l
   -rw-r--r-- scott scott 102519 job.txt
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   $ zip -0 moby.zip *.txt
   adding: job.txt (stored 0%)
   adding: moby-dick.txt (stored 0%)
   adding: paradise_lost.txt (stored 0%)
   $ ls -l
   -rw-r--r-- scott scott 102519 job.txt
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 1848444 moby.zip
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   $ zip -1 moby.zip *txt
   updating: job.txt (deflated 58%)
   updating: moby-dick.txt (deflated 54%)
   updating: paradise_lost.txt (deflated 50%)
   $ ls -l
   -rw-r--r-- scott scott 102519 job.txt
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 869946 moby.zip
   -rw-r--r-- scott scott 508925 paradise_lost.txt
   $ zip -9 moby.zip *txt
   updating: job.txt (deflated 65%)
   updating: moby-dick.txt (deflated 61%)
   updating: paradise_lost.txt (deflated 56%)
   $ ls -l
   -rw-r--r-- scott scott 102519 job.txt
   -rw-r--r-- scott scott 1236574 moby-dick.txt
   -rw-r--r-- scott scott 747730 moby.zip
   -rw-r--r-- scott scott 508925 paradise_lost.txt

In tabular format, the results look like this:

Bookzip -0zip -1zip -9
Moby-Dick0%54%61%
Paradise Lost0%50%56%
Job0%58%65%
Total (in bytes)1848444869946747730

The results you see here would vary depending on the file types (text files typically compress well) and the sizes of the original files, but this gives you a good idea of what you can expect. Unless you have a really slow machine or you're just naturally impatient, you should just use -9 all the time to get the maximum compression.

Note - If you want to be clever, define an alias in your .bashrc file that looks like this:

alias zip='zip -9'

That way you'll always use -9 and won't have to think about it.

Password-Protect Compressed Zip Archives

     -P

     -e

The Zip program allows you to password-protect your Zip archives using the -P option. You shouldn't use this option. It's completely insecure, as you can see in the following example (the actual password is 12345678):

   $ zip -P 12345678 moby.zip *.txt

Because you had to specify the password on the command line, anyone viewing your shell's history (and you might be surprised how easy it is for other users to do so) can see your password in all its glory. Don't use the -P option!

Instead, just use the -e option, which encrypts the contents of your Zip file and also uses a password. The difference, however, is that you're prompted to type the password in, so it won't be saved in the history of your shell events.

   $ zip -e moby.zip *.txt
   Enter password:
   Verify password:
   adding: job.txt (deflated 65%)
   adding: moby-dick.txt (deflated 61%)
   adding: paradise_lost.txt (deflated 56%)

The only part of this that's saved in the shell is zip -e moby.zip *.txt. The actual password you type disappears into the ether, unavailable to anyone viewing your shell history.

Caution - The security offered by the Zip program's password protection isn't that great. In fact, it's pretty easy to find a multitude of tools floating around the Internet that can quickly crack a password-protected Zip archive. Think of password-protecting a Zip file as the difference between writing a message on a postcard and sealing it in an envelope: It's good enough for ordinary folks, but it won't stop a determined attacker.

Also, the version of zip included with some Linux distros may not support encryption, in which case you'll see a zip error: "encryption not supported." The only solution: recompile zip from source. Ugh.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Unzipping Password Protected Zips

Anonymous's picture

You left out how to unzip ZIP files that are password protected in Linux. I'm searching for this elusive bit of information on the internet right now...

Password protectedly adding files by PHP code was not found

Farrukh Shahzad's picture

Password protectedly adding files by PHP code was not found on the internet when i was searching for it... so i come across your article and it gave me the idea to why not issue a system command by php to add files in zip and even protect the files by password ;)

RAR

Amelia's picture

RAR is good and free too. It supports passwords and can make SFX archives.

No mention of lzma?

Brian Cain's picture

How about rzip or lzma? I recall an article in the print edition within the last ten or eleven issues that compared the cpu overhead of each compression method against compression ratios (and possibly other parameters). Anyways, rzip is memory and cpu intensive, IIRC, but has the potential to make enormous savings. I think it's the same as burrows-wheeler over larger data sets, possibly. Worthwhile for stuff that won't be frequently decompressed, IMO.

rzip

Anonymous's picture

actually rzip levels are in search buffer sizes:

-0 = 100MB
-1 = 100MB
-x = x00MB for x>0 and x<=9

cpu intensive? well depends. I hacked bzip2 compression hooks out of the rzip and it's one of the fastest pre archiving filters with best compression ratio for mysql dump of dbmail database.

yup found bug but only in decompression algorithm - not the data itself. yes - made Andrew to fix it.

Correction to wording

DAKH's picture

Scott,

In the section "Archive Files with tar", paragraph 3, you state that tar is "designed to compress entire directory structures". I think this should read "designed to archive...", since this section deals only with tar's standalone use as an archival tool and since this article/chapter is intended to highlight the difference between archiving and compressing. Other than that, this is a very handy primer on archiving and compressing in *nix.

bzip2 -9

Chris Thompson's picture

The article states that the default block size for bzip2 is -6. The man page for my system (Ubuntu 6.06) states that -9 is the default, and I am unaware of any system where -6 is the default.

TROGDOR STRIKES AGAIN!

TROGDOR's picture

Making -9 the default

Craig Buchek's picture

An easier way to default to the best (-9) compression level would be to export GZIP='-9' and ZIPOPTS='-9' into your environment.

White Paper
Fabric-Based Computing Enables Optimized Hyperscale Data Centers

Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.

Learn More

Sponsored by AMD

White Paper
Red Hat White Paper: Using an Open Source Framework to Catch the Bad Guy

Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6

Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.

Learn more about catching the bad guy in this free white paper.

Learn More

Sponsored by DLT Solutions