Introduction to Gawk

For many simple programming problems, awk is an excellent solution. Let Ian Gordon show you how to make your life easier.
Main Input Loop

Work in almost any programming language and you will have to write code to get the names of any files from the command line, open these files, and read their contents. For most file access, gawk let you skip these steps entirely. If you pass one or more file names on the command line, after executing the code in the BEGIN block (if present), gawk will automatically get the name from the command line, open a file, read its contents line-by-line, try to match any pattern you have defined against these lines, close the file when it is finished, and move onto the next file listed. If the input is coming from standard input (i.e., you are piping the output of another program to your gawk program), the input process is equally transparent. However, if you find that you need to handle this file input in some different manner, gawk provides you with all the tools necessary to do this. But for most of the file handling you will need, it is better to let gawk's input loop do the work for you.

Running a gawk Program

Now that we have seen how a gawk program works the next step is to see how to make your program run. With gawk on Linux, we have three ways to do this. For those truly quick-and-dirty tasks, an entire gawk program can be written and executed on the command line, although this is really only practical for very small programs. Using our simple example from above, we can run it with the command:

gawk '/Linux/ {print}' file.txt

When running a gawk script from the command line, you must enclose the awk statements in single quotes and list any data files after the closing quote. If you need to use more that one gawk statement in an action block, simply separate each statement using the semicolon. For example, if you wanted to print each line that contained “Linux” and keep a count of how many input lines contain the pattern /Linux/ you could write

gawk '/Linux/{ print; count=count+1 }
END { print count " lines" }' file.txt

You can list any number of data files on the command line and gawk will automatically open and read them, looking for any lines which match the pattern defined.

You can also use your favourite editor to write your gawk program and pass the name of the file to gawk using the -f option to tell gawk to try to execute the contents of that file. (For convenience, I like to use the extension “.awk” on these files, although this is not necessary.) So if the file linux.awk contains the pattern-action block:

/Linux/ {
    count = count + 1
    print count "lines found."

It can be executed by the command:

gawk -f linux.awk file.txt anotherfile.txt

Under Linux (and other versions of Unix) there is another, easier way to run your gawk program—simply put the line

#!/usr/bin/gawk -f

at the top of the program to indicate the path to the gawk interpreter. Make the file executable using the chmod command--chmod +x linux.awk. Then we can execute the gawk program by typing its name and any parameters. (Note: you will need to check the actual location of the gawk interpreter on your system and put this path in the first line.)

Input Fields

Another powerful and time saving feature of gawk is its ability to automatically separate each input line into fields, each referred to by number. The entire line is referred to as $0 and each field within the current line is $1, $2, and so forth. So if the input line is This is a line,

$0 = This is a line
$1 = This
$2 = is
$3 = a
$4 = line

Likewise, the built-in variable NF, which contains the number of fields in the current input line, will be set to 4. If you try to refer to fields beyond NF, their value will be NULL. Another built-in variable, NR, contains the total number of input lines that awk has read so far.

As an example of the use of these fields, if you needed to take the contents of a file and print it out, one word per line (useful if you want to pipe each word in a file to a spell checker), simply run this script:

{ for (i=1;i<=NF;i++) print $i }

To separate the line into fields, gawk uses another built in variable, FS (for “field separator”). The default value of FS is " " so fields are separated by white space: any number of consecutive spaces or tabs. Setting FS to any other character means that fields are separated by exactly one occurence of that character. So if there are two occurences of that character in a row, gawk will present you with an empty field.

To get a better idea of how FS works with input lines, suppose we wanted to print the full names of all users listed in /etc/passwd, where the fields are separated by :. You would need to set FS=":". If the file names.awk contains the following gawk statements:

    print $5

and you run it with gawk -f names.awk /etc/passwd, the program will separate each line into fields and print field 5, which in this case is the full name of the user. However, the line FS=":" will be executed for each line in the data file—hardly efficient. If you are setting FS, it is usually best to make use of the BEGIN pattern, which is run only once, and rewrite our program as:

    print $5

Now the line FS=":" will be executed only once, before gawk starts to read the file /etc/passwd.

This automatic splitting of input lines into fields can be used to make patterns more powerful by allowing you to restrict the pattern matching to a single field. Still using /etc/passwd as an example, if you wanted to see the full name of all users on your Linux system (field 5 of /etc/passwd) who prefer to use csh rather than bash as their chosen shell (field 7 of /etc/passwd), you could run the following gawk program:

# (in awk, anything after the # is a comment)
# change the field separator so we can separate
# each line of the file /etc/passwd and access
# the name and shell fields
BEGIN { FS=":" }
$7 ~ /csh/ {print $5}

The gawk operator ~ means “matches”, so we are testing if the contents of the seven field match csh. If the match is found, then the action block will be executed and the name will be printed. Also, remember that since patterns match substrings, this will also print the names of tcsh users. If a particular input line does not contain a seven field, no problem—no match will be found for this pattern. Similarly, the pattern $7 !~ /bash/ will run its action block if the contents of the seven field do not match the pattern bash. (Unlike the match operator, this pattern will match if $7 does not exist in the current input line. Recall that if we try to access a field beyond NF, its value will be NULL, and NULL does not match /bash/, so the action block for this pattern will be executed.)

To further demonstrate the power of fields and pattern matching, let's go back to the problem of dealing with case sensitivity in pattern matching. By using a built-in function, toupper() or tolower(), we can change the case of all or selected parts of the input line. Suppose we have a data file containing names (the first field) and phone numbers (the second field), but some names are all lower case, some are all upper case and some are mixed. We could simplify the matching by modifing the pattern to:

toupper($1) ~ /LINUX/ {print $0}

This will cause the name in field 1 to be converted to upper case before awk tries to match it against the pattern. No other parts of the input line will compared against the pattern.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

How Slow??

Daniel's picture


You say gawk is slower than Perl. Do you know how much slower? Are there any benchmarks? I've heared that there is an AWK compiler. Do you know anything about it?