Filters: Doing It Your Way

A look at several of the more flexible filters, probrams that read some input, perform some operation on it, and write the altered data as output.
Double Spacing

Next, we'll think about doublespacing a text file. We can do this using sed's substitute command by replacing $ (the regexp for the end of a line) with a newline character (which we have to quote with a backslash)

sed 's/$/\
/' foo

Note that in this example, there isn't a g before the second quote, unlike all the earlier examples. The g is used to tell sed that the substitution applies to all matches on each line, not just the first match on each line, which is the default behaviour. In this case, since each line only has one end, we don't need the g.

Another way of doing this in sed would be:

sed G foo

If you look at the man page for sed, it says that G “appends a newline character followed by the contents of the hold space to the pattern space”. The pattern space is the sed term for the line currently being read, and we don't need to worry about the hold space for now (trust me, it will be empty), so this command does exactly what we want.

It's quite easy to doublespace in awk, using the print statement we saw earlier:

awk '{print $0; print ""}' foo

Here, the pattern is empty again, matching every line, and the action is to print the entire line, $0, then to print nothing, "". Each print statement starts a new line, so the combined effect of the two commands is to doublespace the file.

Awk actions can (and often do) involve more than one command in this way, but it isn't strictly necessary here. Awk provides a formatted print statement that gives more control over the output than the basic print statement. So we could get the same result with:

awk '{printf("%s\n\n",$0)}' foo

The first argument to the printf statement is the format, a description of how the output should appear. The format can contain characters to be printed literally (none in this example), escape sequences (such as \n for a newline), and specifications. A specification is a sequence of characters beginning with a % that controls how the rest of the arguments are printed. For each of the second and subsequent arguments, there must be a specification. In this example, there is one specification, %s, which prints a character string. The value associated with that specification is $0; the entire line. Unlike print, printf doesn't automatically start a new line, so two \n's are needed: one to end the original line and one to insert a blank line.

For this seemingly simple example—doublespacing a file—we came up with four different solutions. There is always more than one way of solving a problem, and it normally doesn't matter which one you take. The point is that you usually write an awk or sed program to do a particular task as the need arises, then discard it. You don't necessarily want the “best” solution (whatever that means), you just want something that works, and you want it quickly.

Being Selective

Another quite common task is to select just part of the input. Suppose we want the fifth line of the file foo. In awk, this would be

awk 'NR==5' foo

which prints the line when NR, the number of lines read so far, equals 5. The sed equivalent is

sed -n 5p foo

By default, sed prints every line of input after all commands have been applied. The -n option suppresses this behaviour, so we only get the line we specifically ask for with the p command. In this case, we asked for the fifth line, but we could just as easily specified a range of lines, say the third to the fifth, with:

sed -n 3,5p foo

or, in awk

awk 'NR>=3 && NR<=5' foo

In the awk version, the && means “and”, so we want the lines where NR>=3 and NR<=5, that is, the third through the fifth lines.

Yet another approach would be to combine head and tail

head -5 foo | tail -3

which uses the head program to get the first 5 lines of the file, and the tail program to only pass the last three lines through.

Yet another common problem is removing only the first line. Remember how the $ character means the end of the line when it is used in a regular expression? Well, when you use it to specify a line number, it means the last line:

sed -n '2,$p' foo

In awk, you can use != or > to get the same result from either of these commands:

awk 'NR>1' foo
awk 'NR!=1' foo
______________________

Webcast
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers

Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.

Learn More

Sponsored by AMD

White Paper
Red Hat White Paper: Using an Open Source Framework to Catch the Bad Guy

Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6

Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.

Learn more about catching the bad guy in this free white paper.

Learn More

Sponsored by DLT Solutions