Treating Compressed and Uncompressed Data Sources the Same

December 19th, 2008 by David Sinck in

Your rating: None Average: 3 (4 votes)

Occasionally, you need to process a number of files—some of which have been compressed and some which have not (think log files). Rather than running two variations, one compressed and one not, wrap it in a bash function:

function data_source ()
{
 local F=$1

 # strip the gz if it's there
 F=$(echo $F | perl -pe 's/.gz$//')

 if [[ -f $F ]] ; then
   cat $F
 elif [[ -f $F.gz ]] ; then
   nice gunzip -c $F
 fi
}

which nicely allows:

for file in * ; do
 data_source $file | ...
done

Whether you're dealing with gzip'd files or uncompressed, you no longer have to treat them differently mentally. With a little more effort, bzip files also could be detected and handled.

__________________________


Special Magazine Offer -- Free Gift with Subscription
Receive a free digital copy of Linux Journal's System Administration Special Edition as well as instant online access to current and past issues. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I don't like using 'for file in *; do ...; done' as it fails on filenames that contain spaces, using the -exec action of find makes avoiding this problem very easy. Also it might be a good idea to at least print the name of each file before outputting the content of each file in the loop.

I'd use the following:

find . -maxdepth 1 -type f -exec sh -c "echo '{}:' ; bzcat '{}' 2>/dev/null || zcat '{}' 2>/dev/null || cat '{}' 2>/dev/null" \;

Or, if you are inclined to use a shell script create "data_source.sh" and 'chmod +x' it:

#!/bin/sh
echo "$1:"
bzcat "$1" 2>/dev/null || zcat "$1" 2>/dev/null || cat "$1" 2>/dev/null

and use the following:

find . -maxdepth 1 -type f -exec ./data_source.sh '{}' \;

john.mckown's picture

I'd prefer using a case

On December 30th, 2008 john.mckown says:

I'd prefer using a case stucture similar to:

function data_source ()
{
local F=$1

case $F in
*.gz) zcat $F;;
*.bz2) bzcat $F;;
*) cat $F;;
}

This only invokes one external command to do the "cat" function. And it is very easy to extend to other possible suffixes that may be special.

roaima's picture

Decompression in a pipe

On December 19th, 2008 roaima says:

If you're processing files in a pipe, try this:

bzcat -f "${FILE}" | zcat -f | ...

It doesn't matter whether ${FILE} is compressed with bzip2, gzip, or even not compressed at all. It just works. (Assuming you have bzcat and zcat installed, of course.)

Chris

john.mckown's picture

I don't see an -f switch for

On December 30th, 2008 john.mckown says:

I don't see an -f switch for bzcat on my system. And I don't understand the piping. Do you mean something like:

bzcat $F || zcat $F || cat $F

?

I don't like the above as it is inefficient. If bzcat fails, then zcat is tried and if it fails, then normal cat is done. Of course, the good part is that this is not dependent on a file suffix.

Anonymous's picture

Another way to decompression in a pipe

On December 25th, 2008 Anonymous (not verified) says:

Another way if bzcat is not installed...

gzip -dc "${FILE}" | ...

Ian's picture

Perl?

On December 19th, 2008 Ian (not verified) says:

Reaching for the Advanced Bash Scripting Guide...

Check out Table B-5 String Operations and

${string%%substring} Strip longest match of $substring from back of $string

Vance's picture

Re: Perl?

On December 20th, 2008 Vance (not verified) says:

${string%%substring} Strip longest match of $substring from back of $string

Great tip. Or if you want to kick it old-school UNIX style:

F=`dirname $F`/`basename $F .gz`

Post new comment

Please note that comments may not appear immediately, so there is no need to repost your comment.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.

More information about formatting options

Newsletter

Each week Linux Journal editors will tell you what's hot in the world of Linux. You will receive late breaking news, technical tips and tricks, and links to in-depth stories featured on www.linuxjournal.com.
Sign up for our Email Newsletter

Tech Tip Videos

From the Magazine

July 2009, #183

News Flash: Linux Kernel 3.0 to include an on-the-go Expresso machine interface! Ok, maybe not, but Linux is definitely going mobile, from phones to e-readers. Find out more inside about Android, the Kindle 2, the Western Digital MyBook II, The Bug, and Indamixx (a portable recording studio). And if you've gone mobile and you been wanting more Emacs in your life then check out Conkeror.


To compliment the mobile we've got the stationary: parsing command line options with getopt, checking your Ruby code with metric_fu, and building a secure Squid proxy. How is this stationary you ask? What can we say? It's not. We just wanted to see if anybody actually read this part of the page :) .


All this and more, and all you have to do is get your hot sweaty hands on the latest copy of Linux Journal.





Read this issue