Filters: Doing It Your Way

A look at several of the more flexible filters, probrams that read some input, perform some operation on it, and write the altered data as output.

One of the basic philosophies of Linux (as with all flavours of Unix) is that each program does one particular task, and does it well. Often you combine several programs to achieve something, either at the shell prompt or in a script, by piping the output of one program into the next. I'm talking about things like

ls -l | more


ps -auxw | \
  grep netscape >>

But what if the output of one program isn't in the format needed for the next? We need some way of processing the output of one program so that it is ready for the next.

Fortunately, there are many Linux programs that do this job: read some input, perform some operations on it, and write the altered data as the output. These programs are called filters. Some filters do quite limited tasks, such as head, grep and sort, whereas others are more flexible, such as sed and awk. In this article, we're going to look at several of these more flexible filters, and give several examples of what can be done with them.

The name “sed” is a contraction of stream editor; sed applies editing commands to a stream of data. A common use for sed is to replace one text pattern with another, as in

sed 's/Fred/Barney/g' foo

This command takes the file foo, changes every occurrence of Fred to Barney, and writes the modified version to standard output.

Note that in this example we have placed the actual sed commands inside single quotes. Sed doesn't require that commands be quoted this way, but you will need to use quotes if the sed command includes characters that are special to the shell, such as $ or *. This example doesn't have any special characters, so we could just as easily have left out the quotes. Try it and see.

Without the input file foo, sed reads from standard input, so we could achieve the same result with the command

sed 's/Fred/Barney/g' < foo


cat foo | sed 's/Fred/Barney/g'

Note that the first two versions are generally preferred to the third. Using cat just to send input into a pipe creates an extra process which can often be avoided.

We also have to consider the output. By default, the results appear on standard output, but this isn't always what we want. One option is to pipe the output through a pager, for example

sed 's/Fred/Barney/g' foo | more

or to redirect it to a file

sed 's/Fred/Barney/g' foo > bar

While it is often tempting to write

sed 's/Fred/Barney/g' foo > foo

the only thing this achieves is to delete contents of the file foo! Why? Because the first thing the shell does with this command is to open the file foo for output, destroying what was there already. When it tries to read from foo, there is nothing there to read. The result is an empty file. This is an easy mistake to make when redirecting output in this way, so do be careful.

Awk is a bit more flexible than sed; it is a full-fledged programming language in its own right. However, don't let that put you off. Writing simple programs in awk is surprisingly easy, and it often doesn't feel like a programming language [See page 46 of Linux Journal issue 25, May 1996—ED]. For example, the command

awk '{print NR, $0}' foo

prints the file foo, numbering each line as it goes. Awk can also read its input from a pipe or from standard input, exactly like sed, and also writes on standard output, unless you redirect it. The bit between the quotes (which are necessary, since the {} characters are also special characters to the shell) is the awk program. I said they can be simple, didn't I? An awk program is simply a sequence of one or more pattern-action statements, in the form

pattern { action }

Each input line is tested against each pattern in turn. When an input line matches a pattern, the corresponding action is performed. Either the pattern may be empty, in which case every line matches, or the action may be empty, in which case the default action is to print the line.

In the example above, the pattern was empty, so every line matched. The action was to print NR, which is a built-in awk variable containing the number of lines read so far, and then print $0, which is the current line.

Going On

Now that we've seen the basic idea behind sed and awk, we're going to look at some examples. The best way to learn something is to actually do it, and I recommend that you try out some of these examples yourself as you go along, possibly even with one eye on the man pages. We certainly aren't going to cover everything that sed and awk can do, but you will, it is hoped, have more confidence to try things out yourself once you've finished reading this article.

Our first example is to remove all the spaces from a document. This is easily achieved using sed:

sed 's/ *//g' foo

This is like the earlier example with Fred and Barney, only here we have used a regular expression: ' *' (the quotes are included so that you can see the space that is part of the regular expression). sed's s (for substitute) command using regular expressions just like grep. The regexp ' *' matches one or more spaces, which are replaced with nothing—they are deleted. This command doesn't deal with tabs, as it stands, but you could modify it to match one or more occurences of either a tab or a space:

sed 's/[ {tab}][ {tab}]*//g' foo