The awk Utility

An introduction to the Linux data manipulation tool called awk.
awk Invocation

At least two distinct methods can be used to invoke awk. The first includes the awk script in-line within the command line. The second allows the programmer to save the awk script to a file and refer to it on the command line.

Examine the two invocation styles below, formatted in the typical man page notation.

awk '{
awk -Fc -f script_file [data-file-list ...]

Notice that data-file-list is always optional, since by default awk reads from standard input. I almost always use the second invocation method, since most of my awk scripts are more than 10 lines. As a general rule, it is a good idea to maintain your awk script in a separate file if it is of any significant size. This is a more organized way to maintain source code and allows for separate revision control and readable comment statements. The -F option controls the input field-delimiter character, which I will cover in detail later. The following are all valid examples of invoking awk at a shell prompt:

$ ls -l | awk -f
$ awk -f
$ awk -F: '{ print $2 }'
$ awk {'print'} input_file
As you will see through examples, awk programming is a process of overriding levels of default actions. The last example above is perhaps the simplest example of invoking awk; it prints each line in the given input file to standard output.

The Language

If you acquire a thorough understanding of awk's behavior, the complexity of the language syntax won't appear to be so great. To provide a smooth introduction, I will avoid examples that take advantage of regular expressions (see “A Word About Regular Expressions”). awk offers a very well-defined and useful process model. The programmer is able to define groups of actions to occur in sequence before any data processing is performed, while each input record is processed, and after all input data has been processed.

With these groups in mind, the basic syntactical format of any awk script is as follows:

BEGIN {

}
{

}
END {

}

The code within the BEGIN section is executed by awk before it examines any of its input data. This section can be used to initialize user-defined variables or change the value of a built-in variable. If your script is generating a formatted report, you might want to print out a heading in this section. The code within the END section is executed by awk after all of its input data has been processed. This section would obviously be suitable for printing report trailers or summaries calculated on the input data. Both the END and BEGIN sections are optional in an awk script. The middle section is the implicit main input loop of an awk script. This section must contain at least one explicit action. That action can be as simple as an unconditional print statement. The code in this section is executed each time a record is encountered in the input data set. By default, a record delimiter is a line-feed character. So by default, a record is a single line of text. The programmer can redefine the default value of the record delimiter.

The following input data text will be assumed in each of the following examples. The content of the data is somewhat silly, but serves the exercise well. You can imagine it representing a produce inventory; each line defines a produce category, a particular item and an item count.

fruit: oranges 10
fruit: peaches 11
fruit: plums 11
vegetable: cucumbers 8
vegetable: carrots
fruit: tomatoes 2

We will start off very simply and quickly work into something non-trivial. Notice that I make a habit of always defining each of the three sections, even if the optional sections are stubbed out. This serves as a good visual placeholder and reminds the programmer of the entire process model even if certain sections are not currently useful. Be aware that each of the examples could be collapsed into shorter scripts without any loss of functionality. My intent here is to demonstrate as many awk features as possible through these few examples.

Listing 1.

Look at the example script in Listing 1 and try to relate it to its output:

fruit: oranges 10
fruit: peaches 11
fruit: plums 11
fruit: tomatoes 2

By default, an input record is a line-feed terminated section of text, so if the input contains six lines, the implicit main loop marked by the # (1) comment executes six times. The awk source-code comments are specified with a # character—the interpreter ignores characters from the # to the end of the line (same comment style as the UNIX shell). The built-in variable $0 always contains the entire current record value (see built-in variable table below). The line below the (1) marker checks to see if the current input record is an empty line. If it is, awk goes on to read the next input record. Each field within a record is assigned to an ordered variable—$1 through $N where N is equal to the number of fields in the current record. What determines a field? Well, the default field separator is any “white space”—a space or tab character. The field separator character can be redefined. The line below the # (2) comment will print out the entire record if the first field is set to fruit:. So, when looking at the output produced by Script 1, all lines of type fruit are displayed.

Listing 2.

Take a look at the example script in Listing 2 and try to relate it to its output below. The only noticeable enhancement is the data summary at the end—stating how may of the total units were of type fruit.

fruit: oranges 10
fruit: peaches 11
fruit: plums 11
fruit: tomatoes 2
4 out of 5 entries were of type fruit:.

This time, we made use of the two optional BEGIN and END sections of the awk script. The group of statements preceded by the # (1) comment initialize some programmer-defined variables: FCOUNT, COUNT and TYPE—representing the number of fruit: records encountered, the total number of records and the produce-category name. Notice that the line preceded by the # (3) unconditionally increments the record counter (also note that syntax is borrowed from the C language). The section of code preceded by the # (4) comment now references the TYPE variable instead of a literal string, and increments the FCOUNT variable. The next section of code makes use of the printf built-in function (works just as the C-library printf does, but differs a bit syntactically) to print out a sub-count and a total count.

Listing 3.

Look at the example script in Listing 3 and try to relate it to its output. Notice that the only records displayed are those which were flagged as an error and those indicating a supply shortage. The summarization at the end of the output now includes additional information. Output from Listing 3:

Parsing inventory file "input_data"
Bad data encountered: vegetable: carrots
Short on tomatoes: 2 left
4 out of 5 entries were of type Fruit.
1 out of 5 entries were of type Vegetable.
0 out of 5 entries were of type Other.
1 out of 5 entries were flagged as bad data.
1 out of 5 entries were flagged in short supply

In this third example, we make further use of the two optional BEGIN and END sections. Once again, the BEGIN section initializes some programmer-defined variables. It also prints out a heading that indicates the name of the input file (the built-in variable FILENAME is referenced). Notice the new code section preceded by the # (3) comment. The NF variable is a built-in that always contains the number of fields contained in the current record. Since white space is still our field delimiter, we would always expect three fields. This code section flags and displays a record that is deemed bad data. Also, a counter maintaining the number of errors is incremented. Since records deemed invalid are useless, the program then goes on to process the next input record. The code section preceded by the # (5) comment was altered to maintain additional counts based on the produce category type.

Now let's assume a system administrator is asked to determine the proportions certain shell interpreters are being used with the choices of the standard Bourne Shell, the Korn Shell and the C Shell. The script will provide a breakdown of usage by total count and percentages and flag the instances where a login shell was not applicable or not assigned to a system user. Examine the script in Listing 4—it satisfies our requirement. Relate the code to its output in Listing 5.

Listing 4.

Listing 5.

The first thing worth noticing in the Listing 4 script is the assignment to the built-in variable FS—the input field delimiter. Entries in the /etc/passwd file are made up of colon separated fields. Field 7 indicates which program (shell) is run on behalf of that user at login time. Entries with an empty field 7 are printed out, then the summary report is printed.

Thus far, we have reviewed awk's behavior through several small examples of code. The features demonstrated provide a working foundation. You have seen the execution flow of an awk process. You have seen built-in and user-defined variables being manipulated. And you have seen a few built-in awk functions applied. As with any high-level language, one can be very creative with awk. Once you get comfortable, you will want to put it to more sophisticated use. Most Linux systems today offer the features of nawk (new awk), which was developed in the late 1980s. nawk and GNU's gawk make it possible to do the following within an awk script:

  • Include programmer-defined functions.

  • Execute external programs and process the results.

  • Manipulate command line arguments more easily.

  • Manage multiple I/O streams.

Table 1.

Table 2.

As a reference, Tables 1 and 2 define the most common built-in variables and functions. Also, note that the following operators each have the same meaning in awk as they do in C (refer to the awk man page):

* / % + - = ++ -- += -= *= /= %=

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Take Command: The awk Utility

DrScriptt's picture

Hat's off to you Louis. You have yet another very well written and informative article. I have often wondered about messing with awk to see exactly what it is and what it could do for me. Again, (L.J. Admin's are you reading this?) I really like this type of article. It gives me enough information to see what a tool is and what it is capable of and how to go about playing with it. I really like the example code. I can sh in and test the code that is presented with out any problems.

Keep up the good work. :)

DrScriptt

drscriptt@riverviewtech.net

I have to say 'Hat's off to

Anonymous's picture

I have to say 'Hat's off to you Louis' !!!. Nice article ..well written, simple and very very informative . Thanx a ton !

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState