The awk Utility

An introduction to the Linux data manipulation tool called awk.

Partly tool and partly programming language, awk has had a reputation of being overly complex and difficult to use. This column demonstrates its usefulness without getting hung up on the complexity.

Scripting languages such as the UNIX shell and specialty tools like awk and sed have been a standard part of the UNIX landscape since it became commercially available. In 1982, “real programmers” used C for everything. Tools such as sed and awk were viewed as slow, large programs that “hogged” the CPU. Even applications that performed structured data processing and report-generation tasks were implemented in fast, compiled languages like C.

Part of my motivation for writing this article comes from observing that, even today, most system administrators and developers are either uninformed about or intimidated by utilities like awk and sed. As a result, tasks that should be automated continue to be performed manually (or not at all), or duller tools are used instead.

Admittedly, both awk and sed are rather peculiar tools/languages. Both recognize traditional UNIX “regular expressions”—powerful, but not trivial to learn. Both tools seem to offer too many features—quite often providing several ways of performing the same task. Therefore, mastering all the features of awk and sed and confidently applying them can take awhile—or so it may seem. First impressions notwithstanding, you can quickly and effectively apply these tools once you understand their general usefulness and become familiar with a subset of their most useful features. My intent is to provide you with enough information and example code for getting jump-started with awk. You can read about sed in April's “Take Command: Good Ol' sed” by Hans de Vreught.

sed and awk are two of the most productive tools I have ever used. I rely on them quite heavily to implement a wide range of tasks, the implementation of which would take considerably longer using other tools/languages.

I will assume you have heard of or worked with some of the more significant sub-systems of Linux and that you have an understanding of how to use the basic features of the shell command line, such as file I/O and piping. Familiarity with a standard editor such as vi and a working knowledge of regular expressions would also be useful. Many Linux commands, including grep, awk and sed, accept regular expressions as part of their invocation, so you should at least learn the basics.

A Word about Regular Expressions

My coverage of the awk tool is limited to an introductory foundation. Many advanced features are offered by awk (gawk and nawk) but will not be covered here.

General Overview

The meaning behind the name of this tool is not terribly interesting, but I'll include an explanation to solve the mystery of its rather uncommon name. awk was named after its original developers: Aho, Weinberger and Kernighan. awk scripts are readily portable across all flavors of UNIX/Linux.

awk is typically engaged to reprocess structured textual data. It can easily be used as part of a command-line filter sequence, since by default, it expects its input from the standard input stream (stdin) and writes its output to the standard output stream (stdout). In some of the most effective applications, awk is used in concert with sed—complementing each other's strengths.

The following shell command scans the contents of a file called oldfile, changing all occurrences of the word “UNIX” to “Linux” and writing the resulting text to a file called newfile.

$ awk '{gsub(/UNIX/, "Linux"); print}' oldfile \>\
newfile

Obviously, awk does not change the contents of the original file. That is, it behaves as a stream editor should—passively writing new content to an output stream. This example barely demonstrates anything useful, but it does show that simple tasks can be implemented simply. Although awk is commonly invoked from a parent shell script covering a grander scope, it can be (and often is) used directly from the command line to perform a single straightforward task as just shown.

Although awk has been employed to perform a variety of tasks, it is most suitable for parsing and manipulating textual data and generating formatted reports. A typical (and tangible) example application for awk is one where a lengthy system log file needs to be examined and summarized into a formatted report. Consider the log files generated by the sendmail daemon or the uucp program. These files are typically lengthy, boring and generally hard on a system administrator's eyes. An awk script can be employed to parse each entry, produce a set of category counts and flag those entries which represent suspicious activity.

The most significant characteristics of awk are:

  • It views its input as a set of records and fields.

  • It offers programming constructs that are similar (but not identical) to the C language.

  • It offers built-in functions and variables.

  • Its variables are typeless.

  • It performs pattern matching through regular expressions.

awk scripts can be very expressive and are often several pages in length. The awk language offers the typical programming constructs expected in any high-level programming language. It has been described as an interpreted version of the C language, but although there are similarities, awk differs from C both semantically and syntactically. A host of default behaviors, loose data typing, and built-in functions and variables make awk preferable to C for quick-prototyping tasks.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Take Command: The awk Utility

DrScriptt's picture

Hat's off to you Louis. You have yet another very well written and informative article. I have often wondered about messing with awk to see exactly what it is and what it could do for me. Again, (L.J. Admin's are you reading this?) I really like this type of article. It gives me enough information to see what a tool is and what it is capable of and how to go about playing with it. I really like the example code. I can sh in and test the code that is presented with out any problems.

Keep up the good work. :)

DrScriptt

drscriptt@riverviewtech.net

I have to say 'Hat's off to

Anonymous's picture

I have to say 'Hat's off to you Louis' !!!. Nice article ..well written, simple and very very informative . Thanx a ton !

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix