Book Excerpt: A Practical Guide to Linux Commands, Editors, and Shell Programming

This article is an excerpt from the new 2nd Ed. of Mark Sobell's book, A Practical Guide to Linux Commands, Editors, and Shell Programming, published Nov. 2009 by Prentice Hall Professional, ISBN 0131367366, Copyright 2010 Mark G. Sobell. For additional sample content from a selection of chapters, please visit the publisher site: www.informit.com/title/0131367366

Chapter 12: The AWK Pattern Processing Language

AWK is a pattern-scanning and processing language that searches one or more files for records (usually lines) that match specified patterns. It processes lines by performing actions, such as writing the record to standard output or incrementing a counter, each time it finds a match. Unlike procedural languages, AWK is data driven: You describe the data you want to work with and tell AWK what to do with the data once it finds it.

You can use AWK to generate reports or filter text. It works equally well with numbers and text; when you mix the two, AWK usually comes up with the right answer. The authors of AWK (Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan) designed the language to be easy to use. To achieve this end they sacrificed execution speed in the original implementation.

AWK takes many of its constructs from the C programming language. It includes the following features:

  • A flexible format

  • Conditional execution

  • Looping statements

  • Numeric variables

  • String variables

  • Regular expressions

  • Relational expressions

  • C’s printf

  • Coprocess execution (gawk only)

  • Network data exchange (gawk only)

Syntax

A gawk command line has the following syntax:

gawk [options] [program] [file-list]
gawk [options] –f program-file [file-list]

The gawk utility takes its input from files you specify on the command line or from standard input. An advanced command, getline, gives you more choices about where input comes from and how gawk reads it (page 558). Using a coprocess, gawk can interact with another program or exchange data over a network (page 560; not available under awk or mawk). Output from gawk goes to standard output.

Arguments

In the preceding syntax, program is a gawk program that you include on the command line. The program-file is the name of the file that holds a gawk program. Putting the program on the command line allows you to write short gawk programs without having to create a separate program-file. To prevent the shell from interpreting the gawk commands as shell commands, enclose the program within single quotation marks. Putting a long or complex program in a file can reduce errors and retyping.

The file-list contains the pathnames of the ordinary files that gawk processes. These files are the input files. When you do not specify a file-list, gawk takes input from standard input or as specified by getline or a coprocess.

Options

Options preceded by a double hyphen (– –) work under gawk only. They are not available under awk and mawk.

– –field-separator fs

–F fs
Uses fs as the value of the input field separator (FS variable).

– –file program-file

–f program-file
Reads the gawk program from the file named program-file instead of the command line. You can specify this option more than once on a command line. See examples.

– –help

–W help
Summarizes how to use gawk (gawk only).

– –lint

–W lint
Warns about gawk constructs that may not be correct or portable (gawk only).

– –posix

–W posix
Runs a POSIX-compliant version of gawk. This option introduces some restrictions; see the gawk man page for details (gawk only).

– –traditional

–W traditional
Ignores the new GNU features in a gawk program, making the program conform to UNIX awk (gawk only).

– –assign var =value

–v var =value
Assigns value to the variable var. The assignment takes place prior to execution of the gawk program and is available within the BEGIN pattern (page 535). You can specify this option more than once on a command line.

Notes

See the tip on the previous page for information on AWK implementations.

For convenience many Linux systems provide a link from /bin/awk to /bin/gawk or /bin/mawk. As a result you can run the program using either name.

Language Basics

A gawk program (from program on the command line or from program-file) consists of one or more lines containing a pattern and/or action in the following format:

pattern { action }

The pattern selects lines from the input. The gawk utility performs the action on all lines that the pattern selects. The braces surrounding the action enable gawk to differentiate it from the pattern. If a program line does not contain a pattern, gawk selects all lines in the input. If a program line does not contain an action, gawk copies the selected lines to standard output.

To start, gawk compares the first line of input (from the file-list or standard input) with each pattern in the program. If a pattern selects the line (if there is a match), gawk takes the action associated with the pattern. If the line is not selected, gawk does not take the action. When gawk has completed its comparisons for the first line of input, it repeats the process for the next line of input. It continues this process of comparing subsequent lines of input until it has read all of the input.

If several patterns select the same line, gawk takes the actions associated with each of the patterns in the order in which they appear in the program. It is possible for gawk to send a single line from the input to standard output more than once.

Patterns

~ and !~

You can use a regular expression (Appendix A), enclosed within slashes, as a pattern. The ~ operator tests whether a field or variable matches a regular expression (examples on page 543). The !~ operator tests for no match. You can perform both numeric and string comparisons using the relational operators listed in Table 12-1. You can combine any of the patterns using the Boolean operators || (OR) or && (AND).

Table 12-1 Relational operators

Relational operator

Meaning

<

Less than

<=

Less than or equal to

= =

Equal to

!=

Not equal to

>=

Greater than or equal to

>

Greater than

BEGIN and END

Two unique patterns, BEGIN and END, execute commands before gawk starts processing the input and after it finishes processing the input. The gawk utility executes the actions associated with the BEGIN pattern before, and with the END pattern after, it processes all the input. See examples.

, (comma)

The comma is the range operator. If you separate two patterns with a comma on a single gawk program line, gawk selects a range of lines, beginning with the first line that matches the first pattern. The last line selected by gawk is the next subsequent line that matches the second pattern. If no line matches the second pattern, gawk selects every line through the end of the input. After gawk finds the second pattern, it begins the process again by looking for the first pattern again. See examples.

Actions

The action portion of a gawk command causes gawk to take that action when it matches a pattern. When you do not specify an action, gawk performs the default action, which is the print command (explicitly represented as {print}). This action copies the record (normally a line; see “Record separators”) from the input to standard output.

When you follow a print command with arguments, gawk displays only the arguments you specify. These arguments can be variables or string constants. You can send the output from a print command to a file (use > within the gawk program), append it to a file (>>), or send it through a pipe to the input of another program ( | ). A coprocess (|&) is a two-way pipe that exchanges data with a program running in the background (available under gawk only).

Unless you separate items in a print command with commas, gawk catenates them. Commas cause gawk to separate the items with the output field separator (OFS, normally a SPACE).

You can include several actions on one line by separating them with semicolons.

Comments

The gawk utility disregards anything on a program line following a pound sign (#). You can document a gawk program by preceding comments with this symbol.

Variables

Although you do not need to declare gawk variables prior to their use, you can assign initial values to them if you like. Unassigned numeric variables are initialized to 0; string variables are initialized to the null string. In addition to supporting user variables, gawk maintains program variables. You can use both user and program variables in the pattern and action portions of a gawk program. Table 12-2 lists a few program variables.

Table 12-2 Variables

Variable

Meaning

$0

The current record (as a single variable)

$1–$n

Fields in the current record

FILENAME

Name of the current input file (null for standard input)

FS

Input field separator (default: SPACE or TAB)

NF

Number of fields in the current record

NR

Record number of the current record

OFS

Output field separator

ORS

Output record separator (default: NEWLINE)

RS

Input record separator (default: NEWLINE)

In addition to initializing variables within a program, you can use the – –assign (–v) option to initialize variables on the command line. This feature is useful when the value of a variable changes from one run of gawk to the next.

Record separators

By default the input and output record separators are NEWLINE characters. Thus gawk takes each line of input to be a separate record and appends a NEWLINE to the end of each output record. By default the input field separators are SPACE s and TABs; the default output field separator is a SPACE. You can change the value of any of the separators at any time by assigning a new value to its associated variable either from within the program or from the command line by using the – –assign (–v) option.

Functions

Table 12-3 lists a few of the functions gawk provides for manipulating numbers and strings.

Table 12-3 Functions

Function

Meaning

length(str )

Returns the number of characters in str; without an argument, returns the number of characters in the current record (page 545)

int(num)

Returns the integer portion of num

index(str1,str2 )

Returns the index of str2 in str1 or 0 if str2 is not present

split(str,arr,del )

Places elements of str, delimited by del, in the array arr [1]...arr [n ]; returns the number of elements in the array (page 556)

sprintf(fmt,args)

Formats args according to fmt and returns the formatted string; mimics the C programming language function of the same name

substr(str,pos,len)

Returns the substring of str that begins at pos and is len characters long

tolower(str )

Returns a copy of str in which all uppercase letters are replaced with their lowercase counterparts

toupper(str )

Returns a copy of str in which all lowercase letters are replaced with their uppercase counterparts

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Practical Guide to Linux

Antivirus's picture

Great guide and tuto !

I needed learn command for Linux. I just start use this os and want see all possibility.

I think you can do more things if you understant how work basic interface.

Thank's again.

Good week end :)

Great book

Wil20's picture

I bought this book to learn Linux commands. It's quite easy to undersand. I recommand this book !

Vince from Roulette Website

excelent subject

evenstood's picture

great article with great tuto, thanks for your share and your time which you spend for us !

Nico from : guide de jeux

thanks dear,

 Self Dumping Hopper's picture

thanks dear,

I like this site, simply

 Self Dumping Hopper's picture

I like this site, simply amazing.I bookmark and check back soon. Please check out my site as well and let me know what you think.

Book

Gilbert's picture

There really is a lot of detail in this one article. How many pages was this?! Anyway, it is filled with some very useful information. Thanks for taking the time to research and post it for us.

An online casino for the ages...I hope that I can continue casino gratuit and casino en ligne gambling here witht the casino software that looks pretty good and offers the progressive slots.
Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix