Primer on zip, unzip, pkzip.

As much as we all love Linux, it is nevertheless true that occasionally we must force ourselves to deal with the DOS/MS-Windows world, however indirectly. For some of us that involves having a dual-boot system (perhaps via LILO—the LInux LOader—or OS/2's Boot Manager), but even those of us who manage to avoid that fate will sooner or later come across files that originated on some flavor of DOS or Windows system. More than likely, a few of those files will end in .zip—and that's where the unzip command comes in.

unzip is a free utility to process zipfiles, as these things are generally called. Zipfiles are actually archives of one or more other files, almost always compressed to save disk space and/or transmission time. In this regard they are similar to compressed tar archives, which are those files usually ending in .tar.Z, .tar.gz or .tgz that one finds on most Linux ftp sites and many CD-ROM distributions. One major difference between zip files and tar archives: compressed tar archives bundle all of the files together and then compress the result as a single entity; zipfiles compress individual files, then store them in the archive. This zip file method isn't quite as efficient in achieving the maximal overall compression, but it does allow you to list the archive's contents and to extract individual files without decompressing the whole mess.


How does one actually use unzip to list an archive's contents? The simplest way is with the -l option (for “list”):

$ unzip -l quake92p.zip
Archive:  quake92p.zip
 Length    Date    Time    Name
 ------    ----    ----    ----
  36064  06-25-96  13:18   DEICE.EXE
 369135  06-27-96  03:51   QUAKE92P.1
   2618  06-27-96  03:34   README.TXT
    177  06-25-96  20:07   INSTALL.BAT
    206  06-27-96  03:54   QUAKE92P.DAT
 ------                    -------
 408200                    5 files

You have each file's name (on the right), its uncompressed size, and the date and time of its last modification. For many of us, however, especially those long steeped in the terse intricacies of ls, this is a little too short and sweet. For fans of ls, or for anyone wishing to know more about the details of the archive, unzip has an entire mode devoted to listing both useful and obscure zipfile information: zipinfo mode, triggered via the -Z option. (On some systems the zipinfo command exists as a link to unzip and is synonymous with unzip -Z, but this is not true of Slackware distributions as of this writing.) We'll limit ourselves to a description of the default zipinfo listing format:

$ unzip -Z quake92p.zip
Archive:  quake92p.zip   406075 bytes   5 files
-rwxa--     2.0 fat   36064 b- defN 25-Jun-96 13:18 DEICE.EXE
-rw-a--     2.0 fat  369135 b- stor 27-Jun-96 03:51 QUAKE92P.1
-rw-a--     2.0 fat    2618 t- defN 27-Jun-96 03:34 README.TXT
-rwxa--     2.0 fat     177 t- defN 25-Jun-96 20:07 INSTALL.BAT
-rw-a--     2.0 fat     206 t- defN 27-Jun-96 03:54 QUAKE92P.DAT
5 files, 408200 bytes uncompressed, 405569 bytes compressed:  0.6%

You will immediately recognize a certain resemblance to the output of ls -l. The header line gives the archive name, its total size, and the total number of files in it; the trailer gives the number of files listed (in this case all of them), the total uncompressed and compressed data size of the listed files (not counting internal zipfile headers), and the compression ratio. Here the ratio is quite poor, mostly due to the fact that the largest file (QUAKE92P.1) is stored without any compression. In the leftmost column are the file permissions. The next column indicates the version of the archiver, and the one after that is what tells us the files came from the FAT (DOS) file system. Next are the uncompressed file size and a column indicating which files are most likely to be binary and which are probably text. The next three columns note the compression method used on each file; the time stamps; and the full file names.


Now that we know what files we have, how do we actually get the files out? File extraction is as simple as typing unzip and the file name:

$unzip quake92p
Archive:  quake92p.zip
  inflating: DEICE.EXE
  extracting: QUAKE92P.1
  inflating: README.TXT
  inflating: INSTALL.BAT
  inflating: QUAKE92P.DAT

Here we've omitted the .zip suffix; unzip first looks for the file quake92p and, not finding it, checks for quake92p.zip instead. What if we wanted only the README.TXT file? No problem. Anything (well, almost anything) after the zipfile name is taken to be the name of one of the enclosed files:

$unzip quake92p README.TXT
Archive:  quake92p.zip
 inflating: README.TXT

Here you may notice a little snag. If you now edit this file in Linux with an editor like vi, you'll see what looks like ^M at the end of each and every line. Or, if you view the file with a pager like more, you'll discover that any line uncovered by the --More-- prompt gets erased immediately. These problems are due to the fact that DOS and its successors store text files with two end-of-line characters, CR and LF (a.k.a. carriage return and linefeed, respectively, or ^M and ^J, or CTRL-M and CTRL-J), rather than the more efficient single character (LF) used on all Unix systems. So when a Unix utility—like an editor or a pager or a compiler—looks at a DOS text file, it may behave a little oddly or die altogether.

Fortunately there's a simple solution: unzip's -a option. Originally a mnemonic for ASCII conversion, the option these days is used for all sorts of text-file conversions. As a single-letter option it does its best to automatically convert files that are supposedly text, while leaving alone those that are marked binary. Be careful! zip and PKZIP don't always guess correctly when creating the archive, particularly for certain classes of MS-Windows files, and unzip's “text” conversions are almost always irreversible. In other words, don't extract with auto-conversion and then delete the original zipfile without first making sure everything is Okay. unzip does indicate which files it thinks are text when auto-converting, however:

$ unzip -a quake92p
Archive:    quake92p.zip
inflating:  DEICE.EXE               [binary]
extracting: QUAKE92P.1              [binary]
inflating:  README.TXT              [text]
inflating:  INSTALL.BAT             [text]
inflating:  QUAKE92P.DAT            [text]

In this case everything worked as intended. If, for some reason, zip marked a text file as binary and you want to force text conversion, simply double the option: -aa.

But wait, there's more! The discriminating Linux user, happily accustomed to a file system that not only preserves the case of file names but also distinguishes between names differing only in case, is not going to settle for a bunch of all uppercase DOS file names in his or her directories. Enter the -L option. If (and only if) the file came from a single case file system like DOS FAT or VMS, unzip -L will convert it to lowercase upon extraction, thusly:

$ unzip -aL quake92p
Archive:  quake92p.zip
  inflating: deice.exe               [binary]
  extracting: quake92p.1             [binary]
  inflating: readme.txt              [text]
  inflating: install.bat             [text]
  inflating: quake92p.dat            [text]

Isn't that nice?