Introduction to Gawk

For many simple programming problems, awk is an excellent solution. Let Ian Gordon show you how to make your life easier.
Control Structures

The control statements in the gawk language closely resemble those found in C, thus making gawk more easily written and understood by C programmers. gawk contains the pre- and post-increment and decrement operators ++ and --, as well as an if-else statement that looks very much like the one found in C. Also multi-line blocks of code are grouped within { and }. Even the for loop seems to have been taken right out of a C programming book.

This allows you to “mix and match” code which takes advantage of gawk's pattern matching with code that uses more traditional control structures, so if patterns are not sufficient for your task (or you are not sure how to use them to accomplish your task) you can use standard programming techniques as well. Conventional programming with gawk is not covered here; the gawk info page (run info gawk) documents this well, and the goal of this article is to demonstrate gawk's distinguishing features.

Variables and Arrays

Another timesaving feature of gawk is that there is no need to declare a variable before using it. A variable can be a string, an integer, or a floating point number depending on the value assigned to it. gawk will handle conversions for you automatically. As a result, an expression such as total = 2 + "3" is valid and will give the expected result, 5. To make your job even easier, gawk will initialize each variable when it is used for the first time, setting it to 0 for an integer or "" for an integer or a string, respectively. This takes away any worries about uninitialized variables.

gawk also carries this ease of use of variables to arrays. There is no need to declare an array before using it, or even to specify a maximum size for that array. To create an array, simply use it and gawk will allocate the required space for you. As you add more data to the array, its size will automatically expand to accomodate it.

However, the array indices in gawk differ from those in languages such as C, in that gawk indices are associative, rather than numeric.

In an associative array, the array index is associated with the value assigned to it. This means that you can write expressions such as theArray["text"]="this is a line". If you wish, you can still use an integer as the index, as in theArray[50] = "some value". It is also possible to use a mixture of strings, integers, and even floating point numbers as indices in the same array, since gawk treats all indices as strings. So the expression theArray[50] = "some value" is equivalent to theArray["50"] = "some value".

To make working with arrays as easy as possible, awk provides the programmer with several powerful array operators. For example, to test whether a value is present in an array you can use the in operator. For example:

if (someValue in theArray) {
   # action to take if somevalue is in theArray
else {
   # an alternate action if it is not present

To perform an action on all values in an array, such as printing each value contained in it, you can use a variation of the for loop, for example:

for (i in theArray) print i

gawk sets the variable i to the next value in theArray on each pass through the loop and then prints it.

To remove a value from an array, simply use the delete operator. For example, delete theArray["word"] will remove "word" from theArray.

With associative arrays, you can quickly build powerful applications without concern for the traditional overhead of declaring the array, allocating the memory, or searching for an item in the array. And size is not a factor—the following gawk program easily read and stored all 45,101 words from the file /usr/dict/words into an associative array (in this case, using the number of the current line as the array index):

{ words[NR] = $1 }
END { print NR " words read" }

Such a task would be much more involved in C, as you would need to determine how you want to store all the words (An array declared with a size sufficient for all 45101 character strings? A linked list? A binary tree?). You may argue that with C you are free to choose a data structure which will provide much more efficient memory allocation and faster access speed than is possible with an associative array. While this may be true, it does not tell the whole story—it will certainly take you some time to write and test this C program (and very likely, more time to debug it). The power of the associative arrays and the simple, transparent memory management built into gawk means that you are free from dealing with such concerns—just tell gawk what you want and it handles much of the hard work behind the scenes.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

How Slow??

Daniel's picture


You say gawk is slower than Perl. Do you know how much slower? Are there any benchmarks? I've heared that there is an AWK compiler. Do you know anything about it?