Quick and Dirty Data Extraction in AWK
Many years ago, probably close to 20, a regular point was made on the comp Usenet newsgroups about using the minimum tool to get the job done. That is, someone would ask for a quick and dirty way to do something. The followups often included a C solution followed by an AWK solution followed by a sed solution and so on.
Today, I still try to use this philosophy when addressing a problem. In this particular case, explained below, I picked AWK. If any of you old-timers are reading this article, though, I expect you will come up with a sed-based solution.
I signed up for a daily summary of currency exchange rates. It's a free service, and you can subscribe too--just go here. Most days, I take a quick look at how the US dollar is doing against the Euro and then save the e-mail. Some days I simply save the message. I save them because I always have thought that, someday, I would write a program to show the rate exchange trend. But doing so has been a low priority.
A while ago, as I was looking at a few of the saved e-mail messages, I realized that although writing a fancy graphing program to show trends was a low-priority task, writing a quick-and-dirty hack was not. Writing this kind of hack would require less time than what I was spending on doing a random sampling.
What I wanted to do was track dates and numbers and then produce a minimalist graphical display of the trend. The first step was to look at the data. Here is an extract of part of a message:
>From list@en.ucc.xe.net Wed Sep 10 12:22:53 2003 ... XE.com's Currency Update Service writes: Here is today's Currency Update, a service of XE.com. Please read the copyright, terms of use agreement, and information sections at the end of this message. CUS5D0B3D5C16D9 ____________________________________________________________________________ If you find our free currency e-mail updates useful, please forward this message to a friend! Subscribe for free at: http://www.xe.com/cus/ ____________________________________________________________________________ <PRE> Rates as of 2003.09.09 20:46:35 UTC (GMT). Base currency is EUR. Currency Unit EUR per Unit Units per EUR ================================ =================== =================== USD United States Dollars 0.890585 1.12286 EUR Euro 1.00000 1.00000 GBP United Kingdom Pounds 1.41659 0.705920 CAD Canada Dollars 0.651411 1.53513 ... </PRE> For help reading this mailout, refer to: http://www.xe.com/cus/sample.htm ...
The ... lines indicate that I tossed out a lot of uninteresting lines.
I need three things from these e-mail messages to produce my report:
The "Rates as of" line to get the date
The "USD" line to get the actual conversion rate
The </PRE> line to tell me to print the info and clear my variables. I don't really have to clear them if the data is good, but it seemed like a good way to detect bad data. This is a quick hack, yes, but not a disgustingly quick hack.
The numeric part of the solution is really easy. Simply grab the date information and the rate information. When I get to the </PRE> line, print it out.
The graphical portion is accomplished by printing a number of plus signs that corresponds to the rate. To get decent resolution, I would need either a wide printout or some sort of offset. I went for the offset, assuming the Euro would not drop below $.90, which was pretty safe considering the direction it had been going.
Finally, I wanted a heading. Using AWK's BEGIN block, I put in a couple of print statements. I don't like to count characters, so I defined the variable over to be the number of spaces that needs to be placed before the title information in order to align everything. Doing so simply meant I had to run the program, see how far off I was and adjust the variable. Here is the code:
BEGIN {
over = " "
print over, " Cost of Euros in $ by date"
print over, ".9 1.0 1.1 1.2 1.3"
print over, "| | | | |"
}
/Rates as of/ { date = $4 }
/^USD/ { rate = $6 }
/^<\/PRE>/ {
printf "%s %6.3f ", date, rate
rc = (rate - .895) * 100
for (i=0; i < rc; i++) printf "+"
printf "\n"
date = "xxx"
rate = 0
}
Running the program with the mail file as input prints all the result lines, but the order is that of the data in the mail file. So it was the sort program to the rescue. The first field in the output is the date, and some careful choosing of the first character of the title lines means everything sorts correctly, with no options. Thus, to run the AWK program, use:
awk -f cc.as messages | sort
and you get your fancy report. Pipe the result thru more if you have a lot of lines to look at.
Here is a sample of the output from the AWK script:
Cost of Euros in $ by date
.9 1.0 1.1 1.2 1.3
| | | | |
2003.01.02 1.036 +++++++++++++++
...
2003.08.28 1.087 ++++++++++++++++++++
2003.08.29 1.098 +++++++++++++++++++++
2003.08.31 1.099 +++++++++++++++++++++
2003.09.01 1.097 +++++++++++++++++++++
2003.09.02 1.081 +++++++++++++++++++
2003.09.04 1.094 ++++++++++++++++++++
2003.09.05 1.110 ++++++++++++++++++++++
2003.09.07 1.110 ++++++++++++++++++++++
2003.09.08 1.107 ++++++++++++++++++++++
2003.09.09 1.123 +++++++++++++++++++++++
2003.09.10 1.121 +++++++++++++++++++++++
2003.09.11 1.120 +++++++++++++++++++++++
2003.09.12 1.129 ++++++++++++++++++++++++
2003.09.14 1.127 ++++++++++++++++++++++++
2003.09.15 1.128 ++++++++++++++++++++++++
2003.09.16 1.117 +++++++++++++++++++++++
2003.09.17 1.129 ++++++++++++++++++++++++
2003.09.18 1.124 +++++++++++++++++++++++
2003.09.19 1.138 +++++++++++++++++++++++++
Okay, sed experts, have at it.
Copyright (c) 2003, Phil Hughes. Originally published in Linux Gazette issue 95. Copyright (c) 2003, Specialized Systems Consultants, Inc.
Phil Hughes is Group Publisher for SSC Publishing, Ltd.
Phil Hughes
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?



1 hour 20 min ago
8 hours 14 min ago
8 hours 30 min ago
10 hours 21 min ago
16 hours 13 min ago
20 hours 45 min ago
20 hours 45 min ago
22 hours 45 min ago
1 day 7 hours ago
1 day 8 hours ago