Filters: Doing It Your Way

A look at several of the more flexible filters, probrams that read some input, perform some operation on it, and write the altered data as output.
Double Spacing

Next, we'll think about doublespacing a text file. We can do this using sed's substitute command by replacing $ (the regexp for the end of a line) with a newline character (which we have to quote with a backslash)

sed 's/$/\
/' foo

Note that in this example, there isn't a g before the second quote, unlike all the earlier examples. The g is used to tell sed that the substitution applies to all matches on each line, not just the first match on each line, which is the default behaviour. In this case, since each line only has one end, we don't need the g.

Another way of doing this in sed would be:

sed G foo

If you look at the man page for sed, it says that G “appends a newline character followed by the contents of the hold space to the pattern space”. The pattern space is the sed term for the line currently being read, and we don't need to worry about the hold space for now (trust me, it will be empty), so this command does exactly what we want.

It's quite easy to doublespace in awk, using the print statement we saw earlier:

awk '{print $0; print ""}' foo

Here, the pattern is empty again, matching every line, and the action is to print the entire line, $0, then to print nothing, "". Each print statement starts a new line, so the combined effect of the two commands is to doublespace the file.

Awk actions can (and often do) involve more than one command in this way, but it isn't strictly necessary here. Awk provides a formatted print statement that gives more control over the output than the basic print statement. So we could get the same result with:

awk '{printf("%s\n\n",$0)}' foo

The first argument to the printf statement is the format, a description of how the output should appear. The format can contain characters to be printed literally (none in this example), escape sequences (such as \n for a newline), and specifications. A specification is a sequence of characters beginning with a % that controls how the rest of the arguments are printed. For each of the second and subsequent arguments, there must be a specification. In this example, there is one specification, %s, which prints a character string. The value associated with that specification is $0; the entire line. Unlike print, printf doesn't automatically start a new line, so two \n's are needed: one to end the original line and one to insert a blank line.

For this seemingly simple example—doublespacing a file—we came up with four different solutions. There is always more than one way of solving a problem, and it normally doesn't matter which one you take. The point is that you usually write an awk or sed program to do a particular task as the need arises, then discard it. You don't necessarily want the “best” solution (whatever that means), you just want something that works, and you want it quickly.

Being Selective

Another quite common task is to select just part of the input. Suppose we want the fifth line of the file foo. In awk, this would be

awk 'NR==5' foo

which prints the line when NR, the number of lines read so far, equals 5. The sed equivalent is

sed -n 5p foo

By default, sed prints every line of input after all commands have been applied. The -n option suppresses this behaviour, so we only get the line we specifically ask for with the p command. In this case, we asked for the fifth line, but we could just as easily specified a range of lines, say the third to the fifth, with:

sed -n 3,5p foo

or, in awk

awk 'NR>=3 && NR<=5' foo

In the awk version, the && means “and”, so we want the lines where NR>=3 and NR<=5, that is, the third through the fifth lines.

Yet another approach would be to combine head and tail

head -5 foo | tail -3

which uses the head program to get the first 5 lines of the file, and the tail program to only pass the last three lines through.

Yet another common problem is removing only the first line. Remember how the $ character means the end of the line when it is used in a regular expression? Well, when you use it to specify a line number, it means the last line:

sed -n '2,$p' foo

In awk, you can use != or > to get the same result from either of these commands:

awk 'NR>1' foo
awk 'NR!=1' foo