Introduction to Gawk

For many simple programming problems, awk is an excellent solution. Let Ian Gordon show you how to make your life easier.

How often have you thought to yourself, “I should write a program to do that!” only to realize that you will have to write more than just the code needed to solve the problem at hand? Your program will probably need to get the names of data files from the command line, open and read these files, and allocate and manage memory for data storage. This programming overhead can be a lot of effort to write and debug. To make this programming task even less appealing, what if you need this program “right now” and it may be used only once or twice? Does writing this program still seem worth all the effort? If you are using one of the more traditional languages, such as C or C++, perhaps not. However, the awk programming language may be just the right tool for writing the programs you need while minimizing the programming overhead.

gawk, the GNU version of the powerful awk programming language, lets you concentrate on writing the code to solve the problem at hand without worrying about all the overhead required to actually make your program do its job. gawk offers many features designed to help you quickly write useful and powerful programs. With features such as pattern-matching, associative arrays, automatic handling of command-line argument files, and no need for variable declarations, gawk is able to free you from many of the tiresome details that often get in the way of getting the job done.

gawk is suitable for a wide range of applications, from simple, one-line applications to complex applications that will be used on a regular basis. gawk is also a simpler, easier to use alternative to Perl. Although Perl programs will run faster than comparable gawk programs, the syntax and features of gawk are (in my opinion) easier to read and tend not to become quite so obfuscated.

C programmers will find that parts of gawk are already quite familiar to them. In many ways, the syntax of gawk looks very much like the syntax of C, with constructs such as pre- and post-increment and decrement operators, nestable if-else blocks, for loops which look exactly like those in C—even the familiar { and } defining sections of code. This close similarity to C is not such a surprise when you consider that one of the originators of the awk programming language, Brian Kernighan, was also one of the originators of C.

However, beyond this similarity in syntax, awk is a language quite unlike the traditional languages in most common use today.

In this article I will describe the more basic features of working with gawk, the GNU version of awk. There will be many parts of this language that I cannot cover here—for these you will need to consult one of the sources listed in the reference section at the end. Although I will be describing gawk, the features discussed here should be applicable to most versions of the awk programming language. As such, the names gawk and awk are often used interchangeably.

In keeping with the tradition set by countless authors writing about a programming language, here is the ever-popular “Hello World” program written in awk:

BEGIN { print "Hello World" }

Before I explain how to run this program, I will describe how a gawk program, or script, works.

Pattern Matching

A major difference between gawk and most other languages is that gawk is a pattern-matching language. That is, gawk scans its input looking for patterns which have been specified in the gawk program, and executes the block of gawk code associated with that pattern. A gawk program, or script, consists of one or more patterns which the programmer wishes to match against each line of input, and the corresponding action blocks (enclosed between { and }) which are to be executed when that pattern is found in an input line. So a gawk program has the form:

pattern1 { action1 }
pattern2 { action2 }
patternN { actionN }

These patterns, which can consist of a simple expression, a regular expression, a combination of patterns, or even an empty pattern, can be as simple or as complex as needed. To print all lines in a file which contain the word “Linux”, the pattern is simply defined as /Linux/ and the action block is {print}. Thus, the complete gawk program can be written as:

/Linux/ { print }

Action blocks consist of one or more gawk statements enclosed between { and }. In this simple example, the print statement will print everything on each line which contains the pattern “Linux”. However, this program will also match such words as “LinuxKernel”--the pattern does not have to be a discrete word. Also, since pattern matching is case-sensitive by default, it will not match the pattern “linux”.

If you need to match both upper and lower case, the pattern can be changed to allow for this—it just becomes a more complex pattern. If you wanted the pattern to match both “linux” and “Linux”, you could write the pattern as /[Ll]inux/. In this case, you are telling gawk to look for groups of characters that begin with any of the characters enclosed in the square brackets (here, either an upper or lower case “L”) followed by the lowercase letters “inux”. Other options for dealing with case sensitivity are to use the built-in functions tolower() or toupper() to change the case of the input line (or just parts of the line) before the pattern matching takes place, or you can set the built in variable IGNORECASE (in awk, built in variables are always written in upper case) to any non-zero value at the start of your program.

Patterns in gawk can be as simple or as complex as needed to match the desired item in the input line. If you do not specify a pattern, the action block will be executed for every line of input. This is known as an empty pattern. So if you do not explicitly put a pattern into your program, gawk treats the lack of a pattern as a pattern that will match everything in the input.

Alternatively, if you specify a pattern but no action, gawk will provide a default action—namely {print}--for you. So the simple program above can be rewritten as /Linux/, although it is usually better to define an action explicitly, since this results in more readable code.

gawk also defines several special patterns which do not match any input at all, the most commonly used being BEGIN and END. The action block associated with BEGIN will be executed only once, before gawk starts to read the input files, and allow you to take care of any setup and initialization details that may be needed. The action block for the END pattern will be executed after the processing of all input has been completed and is useful for printing any final results from your program. The BEGIN and END patterns are optional—you include them only when there is a need for them.

However, if you wish to write a gawk script that takes no input at all—say for example, the ever-popular “Hello World” program that was shown earlier—your gawk statements must be enclosed in the action block for the BEGIN pattern. Otherwise, gawk will see them as part of the main input loop block (described next) and wait for some input (or a Control-D) before printing—probably not what you want to happen in this case.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

How Slow??

Daniel's picture


You say gawk is slower than Perl. Do you know how much slower? Are there any benchmarks? I've heared that there is an AWK compiler. Do you know anything about it?