Filters: Doing It Your Way

A look at several of the more flexible filters, probrams that read some input, perform some operation on it, and write the altered data as output.
When Line Numbers Are Not Enough

Selecting part of a file using line numbers is easy enough to do, but often you don't know the line numbers you want. Instead, you want to select lines based on their contents. In awk, we can easily select a line matching a pattern, with

awk '/regexp/' foo

Which causes all lines containing regexp to be printed. There is a direct sed equivalent of this:

sed -n '/regexp/p' foo

Of course, we can also use grep to do this kind of thing:

grep 'regexp' foo

but sed can also handle ranges easily. For example, to get all lines of a file up to and including the first line matching a regexp, you would type:

sed -n '1,/regexp/p' foo

or to get all lines including and after the first line matching regexp:

sed -n '/regexp/,$p' foo

Remember that $ means the last line in a file. You can also specify a range based on two regexps. Try

sed -n '/regexp1/,/regexp2/p' foo

Note that this prints all blocks starting with lines containing regexp1 through lines containing regexp2, not just the first one. If there isn't a matching regexp2 for a line containing regexp1, then we get all lines through to the end of the file.

Now we can select some part of the input, based on a regular expression.

We might want to delete some lines that contain a certain pattern. The d command does just that:

sed '/regexp/d' foo

deletes all lines that match the regexp. Or, we might want to delete a block of text:

sed '/regexp1/,/regexp2/d' foo

deletes everything from a line that contains regexp1, up to and including a line that matches regexp2. Again, sed will select all blocks of text delimited by regexp1 and regexp2, so there is a danger we could delete more than we want to.

Inserting text at a given point is possible, too. The command

sed '/regexp/r bar' foo

inserts the contents of the file bar after any line that matches the regexp in the file foo.

Now, we can combine these last two commands to replace a block of text in a file with the contents of another file. We do it like this:

sed -e '/START/r bar' -e '/START/,/END/d' foo

This finds a line containing START, deletes through to a line containing END, then reads in the contents of the file bar. Because the r command doesn't read in the file until the next input line is read, the d command is executed before the new text is read in, so the d command doesn't delete the new text, as one might expect, looking at this command. The -e option tells sed that the next argument is a command, rather than an input file. Although it is optional when there is only one command, if we have multiple commands, they must each be preceded with -e.


These examples have mostly been line oriented, but we are just as likely to want to deal with columns of data. The filter cut can select columns of data. For example, to list the real names of all the users on your system, you could type

cut -f5 -d: /etc/passwd
The 5 argument after -f tells cut to list the
fifth column (where real names are stored), and the -d
flag is used to tell cut which character delimits the
field—in the case of the password file, it's a colon. To get
both the username (which is in the first column) and the real
name, we could use
cut -f1,5 -d: /etc/passwd

Awk is also good at getting at columns of data, we could do these tasks with the following awk commands:

awk -F: '{print $5}' /etc/passwd


awk -F: '{print $1,$5}' /etc/passwd

where the -F flag tells awk what character the fields are delimited by. (Do you see the difference between using cut and using awk for printing more than one field? If not, try running the commands again and looking more closely.)

One advantage of using awk is that we can perform operations on the columns.

For example, if we want to find out how much disk space the files in the current directory take up, we could total up the fifth column of the output of ls -l:

ls -l | grep -v '^d' | \
  awk '{s += $5} END {print s}'

In this command, we use grep to remove any lines that begin with d, so we don't count directories. We chose grep, but we could just as easily have used awk or sed to do this. One pure awk solution could be:

ls -l | awk '! /^d/ {s += $5} END {print s}'

where the awk program only totals the fifth column of lines that don't begin with a d—the exclamation mark before the pattern tells awk to select lines which don't match the regular expression /^d/.


Geek Guide
The DevOps Toolbox

Tools and Technologies for Scale and Reliability
by Linux Journal Editor Bill Childers

Get your free copy today

Sponsored by IBM

Upcoming Webinar
8 Signs You're Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th
Moderated by Linux Journal Contributor Mike Diehl

Sign up now

Sponsored by Skybot