An Introduction to awk

Not everyone learns or uses awk these days, so here's a quick review of what the language can do and some of its features.

The awk programming language often gets overlooked for Perl, which is a more capable language. Out in the real world, however awk is found even more ubiquitously than Perl. It also has a smaller learning curve than Perl does, and awk can be used almost everywhere in system monitoring scripts, where efficiency is key. This brief tutorial is designed to help you get started in awk programming.

The Basics

The awk language is a small, C-style language designed for the processing of regularly formatted text. This usually includes database dumps and system log files. It's built around regular expressions and pattern handling, much like Perl is. In fact, Perl is considered to be a grandchild of the awk language.

awk's funny name comes from the names of its original authors, Alfred V. Aho, Brian W. Kernighan and Peter J. Weinberger. Most of you probably recognize the Kernighan name; he is one of the fathers of the C programming language and a major force in the UNIX world.

Using awk in a One Liner

I began using awk to print specific fields in output. This worked surprisingly well, but the efficiency went through the floor when I wrote large scripts that took minutes to complete. Here, however, is an example of my early awk code:


ls -l /tmp/foobar | awk '{print $1"\t"$9}'

This code takes some input, such as this:


-rw-rw-rw-   1 root     root            1 Jul 14  1997 tmpmsg

and generates output like this:


-rw-rw-rw-      tmpmsg

As shown, the code output only the first and ninth fields from the original input. So you can see why awk is so popular for one-line data extraction purposes. Now, let's move on to a full-fledged awk program.

An awk Program Structure

One of my favorite things about awk is its amazing readability, especially as compared to Perl or Python. Every awk program has three parts: a BEGIN block, which is executed once before any input is read; a main loop, which is executed for every line of input; and an END block, which is executed after all of the input is read. It's quite intuitive, something I often say about awk.

Here is a simple awk program that highlights some of the language's features. See if you can pick out what is happening before we dissect the code:


#!/usr/bin/awk -f
#
# check the sulog for failures..
# copyright 2001 (c) jose nazario
#
# works for Solaris, IRIX and HPUX 10.20
BEGIN {
  print "--- checking sulog"
  failed=0
  }
{
  if ($4 == "-") {
    print "failed su:\t"$6"\tat\t"$2"\t"$3
    failed=failed+1
    }
}
END {
  print "---------------------------------------"
  printf("\ttotal number of records:\t%d\n", NR)
  printf("\ttotal number of failed su's:\t%d\n",failed)
}

Have you figured it out yet? Would it help to know the format of a typical line in the input file--sulog, from, say, IRIX? Here's a typical pair of lines:


        SU 01/30 13:15 - ttyq1 jose-root
        SU 01/30 13:15 + ttyq1 jose-root

Now read the script again and see if you can figure it out. The BEGIN block sets everything up, printing out a header and initializing our one variable--in this case, failed--to zero. The main loop then reads each line of input--the sulog file, a log of su attempts--and compares field four against the minus sign. If they match, it means the attempt failed, so we increment the counter by one and note which attempt failed and when. At the end, final tallies are presented that show the total number of input lines as the number of records--NR, an internal awk variable--and the number of failed su attempts, as we noted. Output looks like this:


failed su:      jose-root       at      01/30   13:15
        ---------------------------------------
        total number of records:        272
        total number of failed su's:    73

You also should be able to see how printf works here, which is almost exactly the way printf works in C. In short, awk is a rather intuitive language.

By default, the field separator is whitespace, but you can tweak that. I set it to be a colon in password files, for example. The following small script looks for users with an ID of 0 (root equivalent) and no passwords:


#!/usr/bin/awk -f
BEGIN { FS=":" }
{
  if ($3 == 0) print $1
  if ($2 == "") print $1
}

Other awk internals you should know and use are "RS" for record separator, which defaults to a newline or \n; "OFS" for output field separator, which defaults to nothing; and "ORS" for output record separator, which default to a new line. All of these can be set within the script, of course.

Regular Expressions

The awk language matches normal regular expressions that you have come to know and love, and it does so better than grep. For instance, I use the following awk search pattern to look for the presence of a likely exploit on Intel Linux systems:


#!/usr/bin/awk -f
{ if ($0 ~ /\x90/) print "exploit at line " NR }

You can't use grep to look for hex value 0x90, but 0x90 is popular in Intel exploits. Its the NOP call, which is used as padding in shell code portions.

You can use awk, though, to look for hex values by using \xdd, where dd is the hex number to look for. You also can look for decimal (ASCII) values by looking for \ddd, using the decimal value. Regular expressions based on text work too.

Random awk Bits

Random numbers in awk are readily generated, but there is an interesting caveat. The rand() function does exactly what you would expect it to--it returns a random number, in this case, between 0 and 1. You can scale it, of course, to get larger values. Here's some example code to show you how, as well as an interesting bit of behavior:


#!/usr/bin/awk -f
{
  for(i=1;i<=10;i++) 
  print rand(); exit
}

Run that a couple of times, and you soon see a problem: the random numbers are hardly random--they repeat every time you run the code!

What's the problem? Well, we didn't seed the random number generator. Normally, we're used to our random number generator pulling entropy from a good source, such as, in Linux, /dev/random. However, awk doesn't do this. To really get random numbers, we should seed our random number generator. The improved code below does this:


#!/usr/bin/awk -f
BEGIN {
  srand()
}
{
  for(i=1;i<=10;i++)
  print rand(); exit
}

The seeding of the random number generator in the BEGIN block is what does the trick. The function srand() can take an argument, and in the absence of one, the current date and time is used to seed the generator. Note that the same seed always produces the same "random" sequence.

Conclusion

This isn't the most detailed introduction to awk that you can find, but I hope it is more clear to you how to use awk in a program setting. Myself, I'm quite happy programming in awk, and I've got a lot more to learn. And, we haven't even touched on arrays, self-built functions or other complex language features. Suffice it to say, awk is hardly Perl's little brother.

Resources

Kernighan's home page contains a list of good awk books as well as the source for the "one true awk", aka nawk. The page also contains a host of other interesting links and information from Kernighan.

The standard awk implementation, nawk (for "new awk", as opposed to old awk, sometimes found as "oawk" for compatability), is based on the POSIX awk definitions. It contains a few functions that were introduced by two other awk implementations, gawk and mawk. I usually keep this one around as nawk and use it to test the portability of my awk scripts. nawk usually is found on commercial UNIX machines, where I often don't have gawk installed.

The GNU project's awk, gawk, also is based on the POSIX awk standard, but it adds a significant number of useful features as well. These include command-line features such as "lint" checking and reversion to struct POSIX mode. My favorite feature in gawk is the line breaks, using \, and the extended regular expressions. The gawk documentation has a complete discussion of GNU extensions to the awk language. This is also the standard awk version found on Linux and BSD systems.

sed & awk is perhaps the most popular book available on these two small languages, and it is highly regarded. It contains, among other things, a discussion of popular awk implementations--gawk, nawk, mawk--a great selection of functions and the usual O'Reilly readability. The awk Home Page lists several other books on the awk programming language, but this one remains my favorite.

Copyright (c) 2001, Jose Nazario. Originally published in Linux Gazette issue 67. Copyright (c) 2001, Specialized Systems Consultants, Inc.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

help in awk

Gil12's picture

i need help to use awk to printout my name in all diffrent possible forms like JOhn, John, joHn...
10x

Aargh.. can't figure this out...

Anonymous's picture

Following is a script file snippet which is intended to replace a keyword with another string, which contains a numeric (line #) index into a key word file of following form:

keyword;replacetext
...
keyword_n;replacetext_n

It parses all *.htm files in a directory, parses and creates equivalent *.htp files. Parsing and replacement works fine. The problem is

#!/bin/sh

INEXT="htm"
OUTEXT="htp"
...
ParseKeys()
{
# $1 = KeyWordFile
find . \( ! -name . -prune \) -type f -name "*.$INEXT" | while read FILE
do
awk 'BEGIN{ FS=";"}
FNR==NR{ s[$1]=$2; next }
{
def=0;
for( i in s ) {
system("echo hello alldef: " def " i: " i " >>trash.txt")
if( $0 ~ i ){
system("echo hello def: " def " >>trash.txt")
system("echo hello i: " i " >>trash.txt")
system("echo hello s[i]: " s[i] " >>trash.txt")
gsub(i,s[i])
}
def=def+1;
}
print $0
}' $1 $FILE > temp
# $FILE is form ./*.htm
outfile=`echo "$FILE" | cut -f2 -d "/" | cut -f1 -d "."`
echo "$outfile"
mv -f temp $outfile.$OUTEXT
done
echo hello
}

The problem is that my line counter (def) for the keyfile is out of sync with the keyword (iterator - i) from "for( i in s )"

I get output like:

hello alldef: 0 i: blue_pill

which should be:

hello alldef: 32 i: blue_pill

since blue_pill is on the 32nd line.

does the "for( i in s )" not start iterating at first line, or am I seriously missing something? Is there a better way to do this?

Thanks for looking...

Bill

A awk script is required

Anonymous's picture

Hi ,

I need a script which can compare the names from a file "Names" and give the output of from the /etc/passwd file

EG :

There is a file named "Names" , which contails some userID's the script must compare the userId's from the file with the /etcpasswd file and print the output only the names which were present in the file.

didnt help

Anonymous's picture

hi,

i wanted to do a script that will do the same that is monitoring sulog but these one didnt work it hangs forever.

please let me know if there's anything am doing wrong.

i am new to unix so if you can also expain everything in detail.

thank you in-advance,

nomsa

Big AWK program

Anonymous's picture

For those that wanted a larger actual AWK program:

The Linux Documentation Project has an AWK program to parse Apache web logs to determine actual web statistics and order of reading. There is a lot more to it, details (including manual and sample runs) are here

The actual AWK program is here.

Fine

cocozz's picture

Hey fine tutorial there ;-) I'm starting to learn sed&awk and I'm loving them more and more, very usefull.

Re: Tutorial-Search

Viktor Chuballa's picture

Of course I can search on the net...

> http://www.google.com/search?q=tutorial+awk
> Google results 1-10 of about 23,600 for tutorial awk.

Re: Tutorial

Peer Schwarzer's picture

However 23.000 tutorials is too much... (and Google gave many Perl,
C, and other tutorials for 'tutorial+awk' search..., hmm).

I need a _recommended tutorial_ from people who use and know AWK...

Many so called tutorials jump from the elementary examples to the
most complicated AWK examples, and in-between is a vast, empty
knowledge field... :(

Ah, the memories...

Roger Rohrbach's picture

I just moved, and unpacked a box of technical books I'd kept solely for sentimental reasons. Two of the books were the first edition of Winston and Horn's LISP and Kernighan's The Awk Programming Language. It reminded me of how I used to love playing with offbeat languages (anyone remember Icon?).

Oh, wait. I still love playing with offbeat languages.

Anyway: I once wrote a Lisp interpreter in (old) awk. This should illustrate that it is indeed a powerful little language.

patterns?

Hawhill's picture

First: Nice introduction, it at least points out the things awk can do. In addition, I definately promote gawk's man page (man awk): it's very well written.

but I've also a bad point on this article: It's wrong about the "parts an awk program consists of". in fact, an AWK program consists of pattern matching (of which only BEGIN/END are mentioned here and the empty pattern that matches all lines) and corresponding code on the one hand and functions on the other hand. In fact, those "if ($3 == 0) print ..."-lines don't look very awk'ish to me. More common should be to write single matching patterns like "($3 == 0) { print ... }" instead of combining them in a empty catch-all pattern.

Thanks for this, very

Anonymous's picture

Thanks for this, very welcome and needed. awk is one of the great underrated resources in Linux. It takes a while, but it is so powerful and so economical its amazing. The number of times you have to do string manipulation which is beyond the use of regular expressions in a text editor is many, and with awk you just get it done in a flash, without breaking out some heavy duty program writing stuff.

Re: AWK-Task

Philipp John's picture

Up to now, the best tutorial remains the book:

The AWK programming language
Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger

Other than that, some comprehensive docs:

http://www.softlab.ece.ntua.gr/facilities/documentation/unix/docs/

Useful but..

Varun Khaneja's picture

Hi there,

I found the tutorial useful but I think it was way too basic. Could you yourself suggest some good material.. maybe your source of reference.

Thanks.

For me it was not too

Kaffee's picture

For me it was not too basic... ;)

More Linux Journal Articles...

Jerry Siebe's picture

I knew I had read about awk on Linux Journal before. :D I didn't know much about it at the time, the reading a previous article here led me to learn a lot more about it. I find awk is a quick and easy tool for some tasks done you're familiar with it.

Introduction to Gawk
http://www.linuxjournal.com/node/1156

The awk Utility
http://www.linuxjournal.com/node/2533

Network Administration with AWK
http://www.linuxjournal.com/article/3132

Real Programming with AWK
http://www.linuxjournal.com/article/6677

Quick and Dirty Data Extraction in AWK
http://www.linuxjournal.com/article/8627

awk tutorial

Anonymous's picture

Hi Varun,

Have you seen this?

http://www.faqs.org/docs/air/tsawk.html

awk references

Anonymous's picture

As someone already mentioned, the gawk man page is a great resource. There are a few good books available, too. The ones I like are, in no particular order -

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState