Real Programming with AWK

by Alan Bradley

The purpose of this article is to demonstrate that real programs can be and are written using the AWK programming language. When I first started working with UNIX, going back 10 years now, no one I knew used anything but shell scripts or AWK to write programs. Even now, I still frequently use AWK to create backup scripts and the like. Often people think that AWK can be used for only basic text manipulation tasks, such as cutting a column from a file, or for system administration tasks, such as killing a process. This is not the case; AWK is effective for a number of programming applications. The examples in this article make use of the standard GNU AWK that ships with Red Hat Linux.

Let's use a scenario in which your boss comes to you complaining that the accounting department needs a program that prints invoices using the Sales and Client files from the accounting system. The problem is the C programmer and the Perl guru both are out sick.

"Not a problem", you say, "I can do that for you."

"What?", says the boss. "You don't know C or Perl".

"Not to worry, I know AWK."

The boss looks on disbelieving, but being in a bit of a fix he hands you the specs and says, "Surprise me".

Take a look at the specs first. We have a sample Sales file (Listing 1), which contains columns for Stock Code, Client Code, Quantity of Item purchased, Item Description and Unit Price of Item. We also have a sample from the Client file (Listing 2), with the columns Client Name, Client Code, Address Line 1, Address Line 2, Address Line 3, Postal/Zip Code and Telephone Number. Notice that the Sales file records use commas as the field delimiter, while the Client file uses a tilde as the field delimiter. The specs ask for invoices to be printed using the A P Building Supply address in the top right hand corner, followed by the Client name and address on the left hand side of the invoice. Underneath that we want a heading consisting of the Stock Code, Item Description, Quantity, Unit Price and Total. Then, for each item a customer bought, we want a separate line with the item's Stock Code, Description, Quantity Purchased, Unit Price and Total. At the end of the invoice we want a grand total in the Total column. The Sales file and the Client file are provided sorted in order of Client Code.

Listing 1. Sample from the Sales File
S2,1362,5,Hammer - Ball Pein,2
S4,1372,3,Pliers,2
S3,1372,1,Screwdriver - Phillips,1.5
S6,1372,2,Ruler,1
S1,1380,1,Acrylic Paint - 5L,20
Listing 2. Sample from the Client File
John Penguin~1351~10 Linux Lane~Linuxville~~1103~(012)1345451
Ray Ram~1362~11 Hard Drive~Platter Heights~Suburbia~1497~(014)2352345
Cliff Keys~1371~2 Dump Lane~Backupville~~3546~(042)2345165
Gill Bates~1372~7 XWindows Way~Dizzy Heights~Richville~7945~(085)3003021
Jim Smith~1380~25 USB Road~Port Harbour~~7407~(022)5473486

The first thing to do is think about the logic we are going to use. The idea is to read a Sales file record and then a Client file record. If the Client Code in the Sales file record does not match the Client Code in the Client file record, then read another Client file record. Repeat this until the Sales file Client Code equals the Client file Client Code. Once they match, write the invoice headers to the invoice file and the first detail line, then read the next Sales file record. If the Client Codes still match, write another detail line and read the next Sales file record. If they don't match, write the invoice total and read the next Client file record. Repeat this until all lines in the Sales file have been processed.

Normally when using AWK to write a program like this, you would write the entire program within the BEGIN { } phase of AWK. This allows you to control when the files are accessed and read, to access multiple files for reading if necessary, to specify the rules for reading them and so on.

We want to make it possible for the user to run the program, and at run time specify that the name of the client file, sales file and the invoice file to be written out. Also, if the user runs the program without the required parameters, we want to print a message that gives the user the correct syntax.

We use the ARGC and ARGV functions of AWK to determine the parameters being passed. ARGC tells us the number of parameters passed to the program, while ARGV is an array that contains the actual parameters passed, starting from ARGV[0]. Parameter number 0 contains the command awk itself, so we are interested in parameter numbers 1 to ARGC-1. Try the following for a basic example of how ARGC and ARGV work.

#!/bin/awk -f
BEGIN {
       print ARGC;
       for (f = 0 ; f < ARGC ; f++)
           print ARGV[f];
      }

Try running this program with different parameters, and take a look at the output. You can run the above program by saving it in a file called testing.sh. Then run testing.sh with a few parameters, e.g. testing.sh 1 q 2 w.

Throughout the program we are going to make use of functions to perform various routines. You have the option of passing values to functions when they are called if necessary. The basic syntax to define a function within AWK is as follows:

function funcname(optional arguments)
{
 statements
 within the function
 return x
}

The arguments would be a comma-separated list of variable names. If we are not going to pass arguments to the function, the parentheses () still need to be present. There must be no space between the function name and the opening parenthesis. One also can use the return x statement from within a function to return a particular value, in this case x. The use of return within the function also is optional.

We are going to call our program MkInvoice.sh. The first step is to create a function called Usage() to explain to the user how the program should be called.

function Usage()
{
 print "Usage: MkInvoice  <SalesFile>  <ClientFile> <InvoiceFile>";
 exit(1);
}

In the main body of the program we can check how many arguments were passed. If it is three, we assume they are the names of the three files needed to run the program and continue. If it is not three, we call the Usage function and then exit the program with error code 1. The error code number you want to exit with is specified in brackets after the exit.

We are going to create functions for reading the Sales file and the Client file and for storing the record contents after each record reads in variables. These two functions are named ReadSales() and ReadClient().

Remember that the specifications mentioned the field separator was a comma in the case of the Sales file and a tilde in the case of the Client file. Before we read each file, we need to tell AWK what the field separator is. This is done by using the built-in awk variable FS.

FS = "," (The separator of the Sales file)
  
FS = "~" (The separator of the Client file)

Setting this before a file is read tells AWK how to split correctly the fields of the record being read. If you had the same field separator in both files, you simply would need to set FS once in your program. Because our file has different record separators, we set FS before reading from either the Sales or the Client file.

When a record is read in AWK, each field is assigned a variable, starting with $1. If one looks at the first line of the Sales file in Listing 1, $1 would have the value of S2, $2 would have the value of 1362, $3 would have the value of 5 and so on. A built-in AWK variable called NF contains the number of fields in the record just read. In this case NF would be 5. The value of $0 would be the entire record, that is, S2,1362,5,Hammer - Ball Pein,2 .

The command getline is used to read the records. Two variables, ClientStat and SaleStat, are assigned to determine if the end of file has been reached. This is all put together as follows:

FS = ",";
SaleStat = getline < SalesFile;

This command causes the first record in the Sales file to be read and its fields to be split into variables $1 to $5. If the Sales file does not exist, the value of SaleStat is -1. If the read of the record is successful, the value of SaleStat is 1, and if the end of the Sales file is reached, the value of SaleStat is 0. Within a program these values can be checked to provide the user with meaningful error messages.

We then assign the various fields to variables.

SCode = $1;
SCId = $2;
SQty = $3;
SDesc = $4;
SPrice = $5;

A similar method (bear in mind the setting of FS) is used to read the Client file and place its fields into variables.

We then create a few functions, PrintHeader(), PrintClient(), PrintInvHeader, to print the A P Building Supplies address, the Client's name and address and the invoice item headings, respectively. Here is the PrintHeader() function:

function PrintHeader()
{
 printf "\t\t\t\t\t\tA P Building Supplies\n"  > InvFile;
 printf "\t\t\t\t\t\t59 Hardware Avenue\n" > InvFile;
 printf "\t\t\t\t\t\tHammerville\n"  > InvFile;
 printf "\t\t\t\t\t\t2439\n\n"  > InvFile;
}

By using the printf function within AWK, we can add special characters to control how the output is printed. "\t" is the tab character, and using so many of them ensures that we print the A P Building Supplies address on the right-hand side of the invoice, per the specifications. Similarly, "\n" indicates a new line or return character. Those of you familiar with C programming should be familiar with the various options available for use with printf. For those not familiar with C programming, here is an example.

printf  "I am %d years old", $1

If $1 contained the value 7, the output would be: I am 7 years old .

The %d indicates that an integer is printed at that position in the string. Commonly used options are %s, %d and %f. The %s indicates that a string is to be printed. Width could be controlled by using %-10s, which would print a 10-character string at the specified location. If the string was longer than 10 characters, it would be truncated. The negative sign before the 10 indicates the string must be left justified; by default a string is right justified. The %f option could be used as %4.2f, which would print a decimal two places to the right of the decimal point in a four-character field.

We need to create a function to print the invoice line items, and it is called PrintInvLine(). This function does a calculation of the invoice item's total by multiplying the quantity of the item (SQty) by the item unit price (SPrice). A variable called RTotal is used to keep a running total of the total value of the invoice. The function looks like this:

function PrintInvLine()
{
 STotal = 0;
 STotal = SPrice * SQty;
 RTotal = RTotal + STotal;
 printf "%10s\t%-19s\t%d\t%1.2f\t\t%5.2f\n",SCode,SDesc,SQty,SPrice,STotal > InvFile;
}

A function called PrintInvTotal() then is created to print the total value of the invoice.

function PrintInvTot()
{
 printf "\t\t\t\t\t\t\t\t------\n" > InvFile;
 printf "\t\t\t\t\t\t\t\t%5.2f\n\n",RTotal > InvFile;
 printf "\f"  > InvFile;
 RTotal = 0;
 InvTotFlag = 0;
 HeaderFlag = 0;
}

As we can seen, the tabs again are used to print the invoice's total value on the right-hand side of the invoice, underneath a dotted line. In addition, we create a flag variable called HeaderFlag to indicate if we are printing an additional invoice line item for a customer or whether we are starting a new invoice and, as such, need to print the headers again. A flag variable called InvTotFlag is used to determine when the end of the invoice has been reached and the invoice total needs to be printed.

Now, to be able to construct the program by using the logic explained above, we need to be aware of a few additional AWK control statements: if, while, break and exit. Within AWK, the if statement has the following syntax:

if (expression)
    {
     statement if expression is true
     }
        
 else
 
    {
     statement if expression is false
     }

So, to check if the Sales file exists, we would say:

if (SaleStat == -1)
   {
    print "Sales file does not exist!!";
    exit(1);
   }

The while statement has this syntax:

while (expression)
    Statement if expression is true

So, for read every record in the Sales file, until we reach the end of file, we would say:

while (EndSales == 0) 
        {
           ReadSales();
        }

Exit, as mentioned earlier, causes the program to exit, and we can exit the program with an error code by specifying exit(x), where x is the error code you wish to return. break, on the other hand, causes an exit from either a while or a for loop.

If we want to evaluate two expressions within a while loop--for example, Name = Bill AND Age = 10--you would use the representation && to indicate the AND.

while (( SCId == CId )  &&  ( EndSales == 0))

If the example had been Name = Bill OR Age = 10, then you would use the representation || to indicate the OR.

Comments in AWK can be inserted by using a # followed by your comment.

Use, for example, if (name == "Alan") #, if the value of name is Alan.

The main processing is as follows:

while  (EndSales == 0)              # While it is NOT the EOF Sales file  
        {
           while (( SCId == CId )  &&  ( EndSales == 0))  
                     # While the Sales file Client Code and the Client file Client Code are 
                     # equal AND it is NOT the EOF Sales File
                 {
                   if (HeaderFlag == 0)  # The first time SCId=CId, so it is a new invoice
                      {
                         PrintHeader();
                         PrintClient();
                         PrintInvHead();
                         HeaderFlag = 1;  # Headers have been printed for this Invoice
                         InvTotFlag = 1;  # An invoice total will need to be printed at
                                          # the end of this invoice
                      }
                    else
                         {
                           PrintInvLine(); # Print the invoice line item
                           ReadSales();    # Read the next Sales file record
                         }
                  }
            if (InvTotFlag == 1)
                PrintInvTotal();  # The Sales file Client Code and the Client file Client Code
                                  # are no longer equal or the EOF Sales file has been reached,
                                  # so print the total for the invoice.
            HeaderFlag = 0;       # Prepare to create a new invoice
            ReadClient();         #  Read the next Client file record
            if (EndClient == 1)
                break;            # If the Client file EOF is reached, break from the loop
        }

The complete program can be seen in Listing 3.

Listing 3. The Complete AWK Program

Conclusion

If you have experience with C programming, you probably will find it quite easy to generate programs in the AWK programming language. I hope this has given you some idea about what AWK is capable of doing. The next time you sit down to write a program, give AWK a chance--it may surprise you yet.

Acknowledgements

Thanks to Ian, Lailaa, Darren and Vinesh for proofreading this article. The references used were the GNU AWK Manual and The AWK Programming Language by Aho, Weinberger and Kernighan.

Alan Bradley works as a Senior Systems Engineer for a large ICT company in South Africa. In between writing AWK programs, he plays chess, fly fishes and makes knives.

Load Disqus comments