Introduction to Gawk
Work in almost any programming language and you will have to write code to get the names of any files from the command line, open these files, and read their contents. For most file access, gawk let you skip these steps entirely. If you pass one or more file names on the command line, after executing the code in the BEGIN block (if present), gawk will automatically get the name from the command line, open a file, read its contents line-by-line, try to match any pattern you have defined against these lines, close the file when it is finished, and move onto the next file listed. If the input is coming from standard input (i.e., you are piping the output of another program to your gawk program), the input process is equally transparent. However, if you find that you need to handle this file input in some different manner, gawk provides you with all the tools necessary to do this. But for most of the file handling you will need, it is better to let gawk's input loop do the work for you.
Now that we have seen how a gawk program works the next step is to see how to make your program run. With gawk on Linux, we have three ways to do this. For those truly quick-and-dirty tasks, an entire gawk program can be written and executed on the command line, although this is really only practical for very small programs. Using our simple example from above, we can run it with the command:
gawk '/Linux/ {print}' file.txt
When running a gawk script from the command line, you must enclose the awk statements in single quotes and list any data files after the closing quote. If you need to use more that one gawk statement in an action block, simply separate each statement using the semicolon. For example, if you wanted to print each line that contained “Linux” and keep a count of how many input lines contain the pattern /Linux/ you could write
gawk '/Linux/{ print; count=count+1 }
END { print count " lines" }' file.txt
You can list any number of data files on the command line and gawk will automatically open and read them, looking for any lines which match the pattern defined.
You can also use your favourite editor to write your gawk program and pass the name of the file to gawk using the -f option to tell gawk to try to execute the contents of that file. (For convenience, I like to use the extension “.awk” on these files, although this is not necessary.) So if the file linux.awk contains the pattern-action block:
/Linux/ {
print
count = count + 1
}
END {
print count "lines found."
}
It can be executed by the command:
gawk -f linux.awk file.txt anotherfile.txt
Under Linux (and other versions of Unix) there is another, easier way to run your gawk program—simply put the line
#!/usr/bin/gawk -f
at the top of the program to indicate the path to the gawk interpreter. Make the file executable using the chmod command--chmod +x linux.awk. Then we can execute the gawk program by typing its name and any parameters. (Note: you will need to check the actual location of the gawk interpreter on your system and put this path in the first line.)
Another powerful and time saving feature of gawk is its ability to automatically separate each input line into fields, each referred to by number. The entire line is referred to as $0 and each field within the current line is $1, $2, and so forth. So if the input line is This is a line,
$0 = This is a line $1 = This $2 = is $3 = a $4 = line
Likewise, the built-in variable NF, which contains the number of fields in the current input line, will be set to 4. If you try to refer to fields beyond NF, their value will be NULL. Another built-in variable, NR, contains the total number of input lines that awk has read so far.
As an example of the use of these fields, if you needed to take the contents of a file and print it out, one word per line (useful if you want to pipe each word in a file to a spell checker), simply run this script:
{ for (i=1;i<=NF;i++) print $i }
To separate the line into fields, gawk uses another built in variable, FS (for “field separator”). The default value of FS is " " so fields are separated by white space: any number of consecutive spaces or tabs. Setting FS to any other character means that fields are separated by exactly one occurence of that character. So if there are two occurences of that character in a row, gawk will present you with an empty field.
To get a better idea of how FS works with input lines, suppose we wanted to print the full names of all users listed in /etc/passwd, where the fields are separated by :. You would need to set FS=":". If the file names.awk contains the following gawk statements:
{
FS=":"
print $5
}
and you run it with gawk -f names.awk /etc/passwd, the program will separate each line into fields and print field 5, which in this case is the full name of the user. However, the line FS=":" will be executed for each line in the data file—hardly efficient. If you are setting FS, it is usually best to make use of the BEGIN pattern, which is run only once, and rewrite our program as:
BEGIN {
FS=":"
}
{
print $5
}
Now the line FS=":" will be executed only once, before gawk starts to read the file /etc/passwd.
This automatic splitting of input lines into fields can be used to make patterns more powerful by allowing you to restrict the pattern matching to a single field. Still using /etc/passwd as an example, if you wanted to see the full name of all users on your Linux system (field 5 of /etc/passwd) who prefer to use csh rather than bash as their chosen shell (field 7 of /etc/passwd), you could run the following gawk program:
# (in awk, anything after the # is a comment)
# change the field separator so we can separate
# each line of the file /etc/passwd and access
# the name and shell fields
BEGIN { FS=":" }
$7 ~ /csh/ {print $5}
The gawk operator ~ means “matches”, so we are testing if the contents of the seven field match csh. If the match is found, then the action block will be executed and the name will be printed. Also, remember that since patterns match substrings, this will also print the names of tcsh users. If a particular input line does not contain a seven field, no problem—no match will be found for this pattern. Similarly, the pattern $7 !~ /bash/ will run its action block if the contents of the seven field do not match the pattern bash. (Unlike the match operator, this pattern will match if $7 does not exist in the current input line. Recall that if we try to access a field beyond NF, its value will be NULL, and NULL does not match /bash/, so the action block for this pattern will be executed.)
To further demonstrate the power of fields and pattern matching, let's go back to the problem of dealing with case sensitivity in pattern matching. By using a built-in function, toupper() or tolower(), we can change the case of all or selected parts of the input line. Suppose we have a data file containing names (the first field) and phone numbers (the second field), but some names are all lower case, some are all upper case and some are mixed. We could simplify the matching by modifing the pattern to:
toupper($1) ~ /LINUX/ {print $0}
This will cause the name in field 1 to be converted to upper case before awk tries to match it against the pattern. No other parts of the input line will compared against the pattern.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Home, My Backup Data Center
- Android is Linux -- why no better inter-operation
1 hour 22 min ago - Connecting Android device to desktop Linux via USB
1 hour 51 min ago - Find new cell phone and tablet pc
2 hours 49 min ago - Epistle
4 hours 18 min ago - Automatically updating Guest Additions
5 hours 26 min ago - I like your topic on android
6 hours 13 min ago - Reply to comment | Linux Journal
6 hours 34 min ago - This is the easiest tutorial
12 hours 48 min ago - Ahh, the Koolaid.
18 hours 27 min ago - git-annex assistant
1 day 26 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
How Slow??
Hi,
You say gawk is slower than Perl. Do you know how much slower? Are there any benchmarks? I've heared that there is an AWK compiler. Do you know anything about it?