Book Excerpt: A Practical Guide to Linux Commands, Editors, and Shell Programming
This article is an excerpt from the new 2nd Ed. of Mark Sobell's book, A Practical Guide to Linux Commands, Editors, and Shell Programming, published Nov. 2009 by Prentice Hall Professional, ISBN 0131367366, Copyright 2010 Mark G. Sobell. For additional sample content from a selection of chapters, please visit the publisher site: www.informit.com/title/0131367366
Chapter 12: The AWK Pattern Processing Language
AWK is a pattern-scanning and processing language that searches one or more files for records (usually lines) that match specified patterns. It processes lines by performing actions, such as writing the record to standard output or incrementing a counter, each time it finds a match. Unlike procedural languages, AWK is data driven: You describe the data you want to work with and tell AWK what to do with the data once it finds it.
You can use AWK to generate reports or filter text. It works equally well with numbers and text; when you mix the two, AWK usually comes up with the right answer. The authors of AWK (Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan) designed the language to be easy to use. To achieve this end they sacrificed execution speed in the original implementation.
AWK takes many of its constructs from the C programming language. It includes the following features:
A flexible format
Conditional execution
Looping statements
Numeric variables
String variables
Regular expressions
Relational expressions
C’s printf
Coprocess execution (gawk only)
Network data exchange (gawk only)
Syntax
A gawk command line has the following syntax:
gawk [options] [program] [file-list] gawk [options] –f program-file [file-list]
The gawk utility takes its input from files you specify on the command line or from standard input. An advanced command, getline, gives you more choices about where input comes from and how gawk reads it (page 558). Using a coprocess, gawk can interact with another program or exchange data over a network (page 560; not available under awk or mawk). Output from gawk goes to standard output.
Arguments
In the preceding syntax, program is a gawk program that you include on the command line. The program-file is the name of the file that holds a gawk program. Putting the program on the command line allows you to write short gawk programs without having to create a separate program-file. To prevent the shell from interpreting the gawk commands as shell commands, enclose the program within single quotation marks. Putting a long or complex program in a file can reduce errors and retyping.
The file-list contains the pathnames of the ordinary files that gawk processes. These files are the input files. When you do not specify a file-list, gawk takes input from standard input or as specified by getline or a coprocess.
AWK has many implementations
The AWK language was originally implemented under UNIX as the awk utility. Most Linux distributions provide gawk (the GNU implementation of awk) or mawk (a faster, stripped-down version of awk). Mac OS X provides awk. This chapter describes gawk. All the examples in this chapter work under awk and mawk except as noted; the exceptions make use of coprocesses. You can easily install gawk on most Linux distributions. See gawk.darwinports.com if you are running Mac OS X. For a complete list of gawk extensions, see GNU EXTENSIONS in the gawk man page or see the gawk info page.
Options
Options preceded by a double hyphen (– –) work under gawk only. They are not available under awk and mawk.
– –field-separator fs
–F fs
Uses fs as the value of the input field separator (FS variable).
– –file program-file
–f program-file
Reads the gawk program from the file named program-file instead of the command line. You can specify this option more than once on a command line. See examples.
– –help
–W help
Summarizes how to use gawk (gawk only).
– –lint
–W lint
Warns about gawk constructs that may not be correct or portable (gawk only).
– –posix
–W posix
Runs a POSIX-compliant version of gawk. This option introduces some restrictions; see the gawk man page for details (gawk only).
– –traditional
–W traditional
Ignores the new GNU features in a gawk program, making the program conform to UNIX awk (gawk only).
– –assign var =value
–v var =value
Assigns value to the variable var. The assignment takes place prior to execution of the gawk program and is available within the BEGIN pattern (page 535). You can specify this option more than once on a command line.
Notes
See the tip on the previous page for information on AWK implementations.
For convenience many Linux systems provide a link from /bin/awk to /bin/gawk or /bin/mawk. As a result you can run the program using either name.
Language Basics
A gawk program (from program on the command line or from program-file) consists of one or more lines containing a pattern and/or action in the following format:
pattern { action }
The pattern selects lines from the input. The gawk utility performs the action on all lines that the pattern selects. The braces surrounding the action enable gawk to differentiate it from the pattern. If a program line does not contain a pattern, gawk selects all lines in the input. If a program line does not contain an action, gawk copies the selected lines to standard output.
To start, gawk compares the first line of input (from the file-list or standard input) with each pattern in the program. If a pattern selects the line (if there is a match), gawk takes the action associated with the pattern. If the line is not selected, gawk does not take the action. When gawk has completed its comparisons for the first line of input, it repeats the process for the next line of input. It continues this process of comparing subsequent lines of input until it has read all of the input.
If several patterns select the same line, gawk takes the actions associated with each of the patterns in the order in which they appear in the program. It is possible for gawk to send a single line from the input to standard output more than once.
Patterns
~ and !~
You can use a regular expression (Appendix A), enclosed within slashes, as a pattern. The ~ operator tests whether a field or variable matches a regular expression (examples on page 543). The !~ operator tests for no match. You can perform both numeric and string comparisons using the relational operators listed in Table 12-1. You can combine any of the patterns using the Boolean operators || (OR) or && (AND).
Table 12-1 Relational operators
|
Relational operator |
Meaning |
|
< |
Less than |
|
<= |
Less than or equal to |
|
= = |
Equal to |
|
!= |
Not equal to |
|
>= |
Greater than or equal to |
|
> |
Greater than |
BEGIN and END
Two unique patterns, BEGIN and END, execute commands before gawk starts processing the input and after it finishes processing the input. The gawk utility executes the actions associated with the BEGIN pattern before, and with the END pattern after, it processes all the input. See examples.
, (comma)
The comma is the range operator. If you separate two patterns with a comma on a single gawk program line, gawk selects a range of lines, beginning with the first line that matches the first pattern. The last line selected by gawk is the next subsequent line that matches the second pattern. If no line matches the second pattern, gawk selects every line through the end of the input. After gawk finds the second pattern, it begins the process again by looking for the first pattern again. See examples.
Actions
The action portion of a gawk command causes gawk to take that action when it matches a pattern. When you do not specify an action, gawk performs the default action, which is the print command (explicitly represented as {print}). This action copies the record (normally a line; see “Record separators”) from the input to standard output.
When you follow a print command with arguments, gawk displays only the arguments you specify. These arguments can be variables or string constants. You can send the output from a print command to a file (use > within the gawk program), append it to a file (>>), or send it through a pipe to the input of another program ( | ). A coprocess (|&) is a two-way pipe that exchanges data with a program running in the background (available under gawk only).
Unless you separate items in a print command with commas, gawk catenates them. Commas cause gawk to separate the items with the output field separator (OFS, normally a SPACE).
You can include several actions on one line by separating them with semicolons.
Comments
The gawk utility disregards anything on a program line following a pound sign (#). You can document a gawk program by preceding comments with this symbol.
Variables
Although you do not need to declare gawk variables prior to their use, you can assign initial values to them if you like. Unassigned numeric variables are initialized to 0; string variables are initialized to the null string. In addition to supporting user variables, gawk maintains program variables. You can use both user and program variables in the pattern and action portions of a gawk program. Table 12-2 lists a few program variables.
Table 12-2 Variables
|
Variable |
Meaning |
|
$0 |
The current record (as a single variable) |
|
$1–$n |
Fields in the current record |
|
FILENAME |
Name of the current input file (null for standard input) |
|
FS |
Input field separator (default: SPACE or TAB) |
|
NF |
Number of fields in the current record |
|
NR |
Record number of the current record |
|
OFS |
Output field separator |
|
ORS |
Output record separator (default: NEWLINE) |
|
RS |
Input record separator (default: NEWLINE) |
In addition to initializing variables within a program, you can use the – –assign (–v) option to initialize variables on the command line. This feature is useful when the value of a variable changes from one run of gawk to the next.
By default the input and output record separators are NEWLINE characters. Thus gawk takes each line of input to be a separate record and appends a NEWLINE to the end of each output record. By default the input field separators are SPACE s and TABs; the default output field separator is a SPACE. You can change the value of any of the separators at any time by assigning a new value to its associated variable either from within the program or from the command line by using the – –assign (–v) option.
Functions
Table 12-3 lists a few of the functions gawk provides for manipulating numbers and strings.
Table 12-3 Functions
|
Function |
Meaning |
|
length(str ) |
Returns the number of characters in str; without an argument, returns the number of characters in the current record (page 545) |
|
int(num) |
Returns the integer portion of num |
|
index(str1,str2 ) |
Returns the index of str2 in str1 or 0 if str2 is not present |
|
split(str,arr,del ) |
Places elements of str, delimited by del, in the array arr [1]...arr [n ]; returns the number of elements in the array (page 556) |
|
sprintf(fmt,args) |
Formats args according to fmt and returns the formatted string; mimics the C programming language function of the same name |
|
substr(str,pos,len) |
Returns the substring of str that begins at pos and is len characters long |
|
tolower(str ) |
Returns a copy of str in which all uppercase letters are replaced with their lowercase counterparts |
|
toupper(str ) |
Returns a copy of str in which all lowercase letters are replaced with their uppercase counterparts |
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Home, My Backup Data Center
- Android is Linux -- why no better inter-operation
1 hour 38 min ago - Connecting Android device to desktop Linux via USB
2 hours 7 min ago - Find new cell phone and tablet pc
3 hours 5 min ago - Epistle
4 hours 33 min ago - Automatically updating Guest Additions
5 hours 42 min ago - I like your topic on android
6 hours 29 min ago - Reply to comment | Linux Journal
6 hours 50 min ago - This is the easiest tutorial
13 hours 4 min ago - Ahh, the Koolaid.
18 hours 43 min ago - git-annex assistant
1 day 42 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?



Comments
Practical Guide to Linux
Great guide and tuto !
I needed learn command for Linux. I just start use this os and want see all possibility.
I think you can do more things if you understant how work basic interface.
Thank's again.
Good week end :)
Great book
I bought this book to learn Linux commands. It's quite easy to undersand. I recommand this book !
Vince from Roulette Website
excelent subject
great article with great tuto, thanks for your share and your time which you spend for us !
Nico from : guide de jeux
thanks dear,
thanks dear,
I like this site, simply
I like this site, simply amazing.I bookmark and check back soon. Please check out my site as well and let me know what you think.
Book
There really is a lot of detail in this one article. How many pages was this?! Anyway, it is filled with some very useful information. Thanks for taking the time to research and post it for us.