Real Programming with AWK

 in
AWK may not be the first language that comes to mind when you need to write a program, but its flexibility may surprise you.

The purpose of this article is to demonstrate that real programs can be and are written using the AWK programming language. When I first started working with UNIX, going back 10 years now, no one I knew used anything but shell scripts or AWK to write programs. Even now, I still frequently use AWK to create backup scripts and the like. Often people think that AWK can be used for only basic text manipulation tasks, such as cutting a column from a file, or for system administration tasks, such as killing a process. This is not the case; AWK is effective for a number of programming applications. The examples in this article make use of the standard GNU AWK that ships with Red Hat Linux.

Let's use a scenario in which your boss comes to you complaining that the accounting department needs a program that prints invoices using the Sales and Client files from the accounting system. The problem is the C programmer and the Perl guru both are out sick.

"Not a problem", you say, "I can do that for you."

"What?", says the boss. "You don't know C or Perl".

"Not to worry, I know AWK."

The boss looks on disbelieving, but being in a bit of a fix he hands you the specs and says, "Surprise me".

Take a look at the specs first. We have a sample Sales file (Listing 1), which contains columns for Stock Code, Client Code, Quantity of Item purchased, Item Description and Unit Price of Item. We also have a sample from the Client file (Listing 2), with the columns Client Name, Client Code, Address Line 1, Address Line 2, Address Line 3, Postal/Zip Code and Telephone Number. Notice that the Sales file records use commas as the field delimiter, while the Client file uses a tilde as the field delimiter. The specs ask for invoices to be printed using the A P Building Supply address in the top right hand corner, followed by the Client name and address on the left hand side of the invoice. Underneath that we want a heading consisting of the Stock Code, Item Description, Quantity, Unit Price and Total. Then, for each item a customer bought, we want a separate line with the item's Stock Code, Description, Quantity Purchased, Unit Price and Total. At the end of the invoice we want a grand total in the Total column. The Sales file and the Client file are provided sorted in order of Client Code.

Listing 1. Sample from the Sales File
S2,1362,5,Hammer - Ball Pein,2
S4,1372,3,Pliers,2
S3,1372,1,Screwdriver - Phillips,1.5
S6,1372,2,Ruler,1
S1,1380,1,Acrylic Paint - 5L,20
Listing 2. Sample from the Client File
John Penguin~1351~10 Linux Lane~Linuxville~~1103~(012)1345451
Ray Ram~1362~11 Hard Drive~Platter Heights~Suburbia~1497~(014)2352345
Cliff Keys~1371~2 Dump Lane~Backupville~~3546~(042)2345165
Gill Bates~1372~7 XWindows Way~Dizzy Heights~Richville~7945~(085)3003021
Jim Smith~1380~25 USB Road~Port Harbour~~7407~(022)5473486

The first thing to do is think about the logic we are going to use. The idea is to read a Sales file record and then a Client file record. If the Client Code in the Sales file record does not match the Client Code in the Client file record, then read another Client file record. Repeat this until the Sales file Client Code equals the Client file Client Code. Once they match, write the invoice headers to the invoice file and the first detail line, then read the next Sales file record. If the Client Codes still match, write another detail line and read the next Sales file record. If they don't match, write the invoice total and read the next Client file record. Repeat this until all lines in the Sales file have been processed.

Normally when using AWK to write a program like this, you would write the entire program within the BEGIN { } phase of AWK. This allows you to control when the files are accessed and read, to access multiple files for reading if necessary, to specify the rules for reading them and so on.

We want to make it possible for the user to run the program, and at run time specify that the name of the client file, sales file and the invoice file to be written out. Also, if the user runs the program without the required parameters, we want to print a message that gives the user the correct syntax.

We use the ARGC and ARGV functions of AWK to determine the parameters being passed. ARGC tells us the number of parameters passed to the program, while ARGV is an array that contains the actual parameters passed, starting from ARGV[0]. Parameter number 0 contains the command awk itself, so we are interested in parameter numbers 1 to ARGC-1. Try the following for a basic example of how ARGC and ARGV work.

#!/bin/awk -f
BEGIN {
       print ARGC;
       for (f = 0 ; f < ARGC ; f++)
           print ARGV[f];
      }

Try running this program with different parameters, and take a look at the output. You can run the above program by saving it in a file called testing.sh. Then run testing.sh with a few parameters, e.g. testing.sh 1 q 2 w.

Throughout the program we are going to make use of functions to perform various routines. You have the option of passing values to functions when they are called if necessary. The basic syntax to define a function within AWK is as follows:

function funcname(optional arguments)
{
 statements
 within the function
 return x
}

The arguments would be a comma-separated list of variable names. If we are not going to pass arguments to the function, the parentheses () still need to be present. There must be no space between the function name and the opening parenthesis. One also can use the return x statement from within a function to return a particular value, in this case x. The use of return within the function also is optional.

We are going to call our program MkInvoice.sh. The first step is to create a function called Usage() to explain to the user how the program should be called.

function Usage()
{
 print "Usage: MkInvoice  <SalesFile>  <ClientFile> <InvoiceFile>";
 exit(1);
}

In the main body of the program we can check how many arguments were passed. If it is three, we assume they are the names of the three files needed to run the program and continue. If it is not three, we call the Usage function and then exit the program with error code 1. The error code number you want to exit with is specified in brackets after the exit.

We are going to create functions for reading the Sales file and the Client file and for storing the record contents after each record reads in variables. These two functions are named ReadSales() and ReadClient().

Remember that the specifications mentioned the field separator was a comma in the case of the Sales file and a tilde in the case of the Client file. Before we read each file, we need to tell AWK what the field separator is. This is done by using the built-in awk variable FS.

FS = "," (The separator of the Sales file)
  
FS = "~" (The separator of the Client file)

Setting this before a file is read tells AWK how to split correctly the fields of the record being read. If you had the same field separator in both files, you simply would need to set FS once in your program. Because our file has different record separators, we set FS before reading from either the Sales or the Client file.

When a record is read in AWK, each field is assigned a variable, starting with $1. If one looks at the first line of the Sales file in Listing 1, $1 would have the value of S2, $2 would have the value of 1362, $3 would have the value of 5 and so on. A built-in AWK variable called NF contains the number of fields in the record just read. In this case NF would be 5. The value of $0 would be the entire record, that is, S2,1362,5,Hammer - Ball Pein,2 .

The command getline is used to read the records. Two variables, ClientStat and SaleStat, are assigned to determine if the end of file has been reached. This is all put together as follows:

FS = ",";
SaleStat = getline < SalesFile;

This command causes the first record in the Sales file to be read and its fields to be split into variables $1 to $5. If the Sales file does not exist, the value of SaleStat is -1. If the read of the record is successful, the value of SaleStat is 1, and if the end of the Sales file is reached, the value of SaleStat is 0. Within a program these values can be checked to provide the user with meaningful error messages.

We then assign the various fields to variables.

SCode = $1;
SCId = $2;
SQty = $3;
SDesc = $4;
SPrice = $5;

A similar method (bear in mind the setting of FS) is used to read the Client file and place its fields into variables.

We then create a few functions, PrintHeader(), PrintClient(), PrintInvHeader, to print the A P Building Supplies address, the Client's name and address and the invoice item headings, respectively. Here is the PrintHeader() function:

function PrintHeader()
{
 printf "\t\t\t\t\t\tA P Building Supplies\n"  > InvFile;
 printf "\t\t\t\t\t\t59 Hardware Avenue\n" > InvFile;
 printf "\t\t\t\t\t\tHammerville\n"  > InvFile;
 printf "\t\t\t\t\t\t2439\n\n"  > InvFile;
}

By using the printf function within AWK, we can add special characters to control how the output is printed. "\t" is the tab character, and using so many of them ensures that we print the A P Building Supplies address on the right-hand side of the invoice, per the specifications. Similarly, "\n" indicates a new line or return character. Those of you familiar with C programming should be familiar with the various options available for use with printf. For those not familiar with C programming, here is an example.

printf  "I am %d years old", $1

If $1 contained the value 7, the output would be: I am 7 years old .

The %d indicates that an integer is printed at that position in the string. Commonly used options are %s, %d and %f. The %s indicates that a string is to be printed. Width could be controlled by using %-10s, which would print a 10-character string at the specified location. If the string was longer than 10 characters, it would be truncated. The negative sign before the 10 indicates the string must be left justified; by default a string is right justified. The %f option could be used as %4.2f, which would print a decimal two places to the right of the decimal point in a four-character field.

We need to create a function to print the invoice line items, and it is called PrintInvLine(). This function does a calculation of the invoice item's total by multiplying the quantity of the item (SQty) by the item unit price (SPrice). A variable called RTotal is used to keep a running total of the total value of the invoice. The function looks like this:

function PrintInvLine()
{
 STotal = 0;
 STotal = SPrice * SQty;
 RTotal = RTotal + STotal;
 printf "%10s\t%-19s\t%d\t%1.2f\t\t%5.2f\n",SCode,SDesc,SQty,SPrice,STotal > InvFile;
}

A function called PrintInvTotal() then is created to print the total value of the invoice.

function PrintInvTot()
{
 printf "\t\t\t\t\t\t\t\t------\n" > InvFile;
 printf "\t\t\t\t\t\t\t\t%5.2f\n\n",RTotal > InvFile;
 printf "\f"  > InvFile;
 RTotal = 0;
 InvTotFlag = 0;
 HeaderFlag = 0;
}

As we can seen, the tabs again are used to print the invoice's total value on the right-hand side of the invoice, underneath a dotted line. In addition, we create a flag variable called HeaderFlag to indicate if we are printing an additional invoice line item for a customer or whether we are starting a new invoice and, as such, need to print the headers again. A flag variable called InvTotFlag is used to determine when the end of the invoice has been reached and the invoice total needs to be printed.

Now, to be able to construct the program by using the logic explained above, we need to be aware of a few additional AWK control statements: if, while, break and exit. Within AWK, the if statement has the following syntax:

if (expression)
    {
     statement if expression is true
     }
        
 else
 
    {
     statement if expression is false
     }

So, to check if the Sales file exists, we would say:

if (SaleStat == -1)
   {
    print "Sales file does not exist!!";
    exit(1);
   }

The while statement has this syntax:

while (expression)
    Statement if expression is true

So, for read every record in the Sales file, until we reach the end of file, we would say:

while (EndSales == 0) 
        {
           ReadSales();
        }

Exit, as mentioned earlier, causes the program to exit, and we can exit the program with an error code by specifying exit(x), where x is the error code you wish to return. break, on the other hand, causes an exit from either a while or a for loop.

If we want to evaluate two expressions within a while loop--for example, Name = Bill AND Age = 10--you would use the representation && to indicate the AND.

while (( SCId == CId )  &&  ( EndSales == 0))

If the example had been Name = Bill OR Age = 10, then you would use the representation || to indicate the OR.

Comments in AWK can be inserted by using a # followed by your comment.

Use, for example, if (name == "Alan") #, if the value of name is Alan.

The main processing is as follows:

while  (EndSales == 0)              # While it is NOT the EOF Sales file  
        {
           while (( SCId == CId )  &&  ( EndSales == 0))  
                     # While the Sales file Client Code and the Client file Client Code are 
                     # equal AND it is NOT the EOF Sales File
                 {
                   if (HeaderFlag == 0)  # The first time SCId=CId, so it is a new invoice
                      {
                         PrintHeader();
                         PrintClient();
                         PrintInvHead();
                         HeaderFlag = 1;  # Headers have been printed for this Invoice
                         InvTotFlag = 1;  # An invoice total will need to be printed at
                                          # the end of this invoice
                      }
                    else
                         {
                           PrintInvLine(); # Print the invoice line item
                           ReadSales();    # Read the next Sales file record
                         }
                  }
            if (InvTotFlag == 1)
                PrintInvTotal();  # The Sales file Client Code and the Client file Client Code
                                  # are no longer equal or the EOF Sales file has been reached,
                                  # so print the total for the invoice.
            HeaderFlag = 0;       # Prepare to create a new invoice
            ReadClient();         #  Read the next Client file record
            if (EndClient == 1)
                break;            # If the Client file EOF is reached, break from the loop
        }

The complete program can be seen in Listing 3.

Listing 3. The Complete AWK Program

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

AWK Cheat Sheet

Peteris Krumins's picture

Hey, this is a great post for beginners on AWK.

For AWK learners I'd like to suggest the AWK cheat sheet I have made and put on my blog:
AWK Cheat Sheet.

This cheat sheet contains:
* Predefined Variable Summary, which lists all the predefined variables and which awk versions (original awk, nawk or gawk) have it built in,
* GNU awk’s command line argument summary,
* I/O statements,
* Numeric functions,
* Bit manipulation functions,
* I18N (internatiolization) functions,
* String functions, and finally,
* Time functions.

For more advanced users, I made a youtube video downloader in GAWK (which supports networking):
Read how it was made (download link on the page) on my blog:
GAWK Youtube Video Downloader

Sincerely,
Peteris Krumins

Re: Real Programming with AWK

Anonymous's picture

Awk does initiate the Nostalgic old man in me.

And it's a great resource for small tasks, also for those small
tasks that might need to be managed in another language afterwards (almost zero learning curve *is* a boost for
project jumpstarting).

Awk is also more available in every kind of setup.

I used to write notes at university at my calculator, then
I transferred to my system and used a pseudoheader
analyzed by AWK in my notes, to redirect them to
the proper file in my system. Clever, compact and thus
quite Unix-like.

Yet I didn't start doing it in Unix, but rather in Windows
using Cygwin: That distribution (b20) lacked perl.

Even at my first fulltime job in a Telephone Company,
AIX and SunOS installations lacked perl, but had awk
readily available.

E.G.
..---------`````\
) :/ \ :/\ // \ )
( || || || || (
\........------``
E-mail: garciag@ieee.org
Cellphone: MEX-55-9198-7119
Web: http://garcia.d2g.biz/garcia/
Messenger: ernestogarciagossio
PGP: http://pgpkeys.mit.edu/ - garcia@prousa.net

Just use Perl, please...

Anonymous's picture

As someone who used to write a lot of shell and awk (I once wrote a 4GL compiler using this...) before Perl was available, I just have to urge anyone reading this article to not bother with awk.

Perl is very easy to get used to, and unlike awk can be used for virtually any programming task, as well as being good for whipping up simple scripts. You can even use its bundled 'a2p' tool to convert scripts from awk to Perl. Try taking some awk scripts and feeding them into a2p to get an idea of how Perl works.

Perl?

Anonymous's picture

Perl is a horrid language, its bloated and extemely un-intuitve and very hard to 'code-read'. Even good Perl programmers find themselves saying WTF when comming back to their own code after a couple of months. Bash, python, php,there are lots of other better choices

Choice

KP's picture

The keywords are choice, time, knowledge and availibility. If you have a few minutes, a some small Linux distribution or old *nix and only awk, the choice is determined: awk. awk is a standard programming language; it's small and practical. Perl has more features; it's more than programming language - it's a technology. So, you can make choice, if you can.

KP.

Re: Just use Perl, please...

Anonymous's picture

Please...

I used a screwdriver to mine marble. It wasn't fun. Therefore, all screwdrivers suck in all contexts.

BigGiantClue: Awk isn't a particularly good tool to impliment "a 4GL complier", whatever that buzzword means this week...

Awk is ruthlessly efficient in its problem space; it's inefficient in all others. Perl tries to be all things to all people, with all the apropos tradeoffs, with the bonus of apropriately byzantine syntax!

Those who can't tell the difference have bigger issues.

Re: Real Programming with AWK

Anonymous's picture

I can't agree more how powerful AWK could be, I've just used awk to complement a shell script to automate duplicating files from a list and setting the proerties as needed, It's is also very simple to learn (for the record i am a 3 weeks old linux user (and i wrote the script 2 weeks ago).. It is simple and usable for freshies

Re: Real Programming with AWK

Anonymous's picture

I use it to grind log files and generate tabular data (spam rejects, naughty hosts from ids) for display as web pages. BEGIN and END are great for appending html headers and footers.

Re: Real Programming with AWK

Anonymous's picture

Well, awk is a fairly opaque language. Why bother learning awk when you can learn perl and do so much more? And please don't tell me that perl is a ressource hog compared to awk! That doesn't wash any more. Forget awk and sed, use perl for goodness sake!

Re: Real Programming with AWK

Anonymous's picture

Absolute crap. Awk is an order of magnatude faster than Perl for anything other than the most complicated awk scripts. Sed is vastly more efficient than perl for what it does.

And awk is "opaque" but perl isn't...oh, I get it...this is a troll! Sorry...I got completely taken...

Re: Real Programming with AWK

Anonymous's picture

- for the same reason that some times is better to do something in perl rather than C or C++.
- for the same reason that other times is better to do something in a shell script rather than awk
- because AWK is learnt in 5 minutes (totally) so there are no cost
I recomend you to learn AWK and you will find several other reasons.

I'm learning Awk at the moment

Anonymous's picture

Hi, Thanks for the artical. I'm learning awk at the moment & its a very nice language to learn, its great for Admin & programming for fields. Its also a nice way to wet your feet for programming.

Re: I'm looking for Alan Bradley

Anonymous's picture

I am not a programmer and I need a script (possibily with awk) to read a .txt file and convert it to html or xml.
The .txt file has a structure which must be known starting with the headlines and titles of the file.
To html the file has to have a file with the title, the headline and the content.
The same wiht xml files.
Best regards.
Marcio@ViaBsb.com

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix