Quick and Dirty Data Extraction in AWK

October 25th, 2005 by Phil Hughes in

Need to pull some data from text--maybe e-mail messages--and sort it, graph it or otherwise manipulate it? Here's an AWK script to use as a starting point.
Your rating: None Average: 4 (1 vote)

Many years ago, probably close to 20, a regular point was made on the comp Usenet newsgroups about using the minimum tool to get the job done. That is, someone would ask for a quick and dirty way to do something. The followups often included a C solution followed by an AWK solution followed by a sed solution and so on.

Today, I still try to use this philosophy when addressing a problem. In this particular case, explained below, I picked AWK. If any of you old-timers are reading this article, though, I expect you will come up with a sed-based solution.

The Problem: Extracting Data from E-mail Messages

I signed up for a daily summary of currency exchange rates. It's a free service, and you can subscribe too--just go here. Most days, I take a quick look at how the US dollar is doing against the Euro and then save the e-mail. Some days I simply save the message. I save them because I always have thought that, someday, I would write a program to show the rate exchange trend. But doing so has been a low priority.

A while ago, as I was looking at a few of the saved e-mail messages, I realized that although writing a fancy graphing program to show trends was a low-priority task, writing a quick-and-dirty hack was not. Writing this kind of hack would require less time than what I was spending on doing a random sampling.

What I wanted to do was track dates and numbers and then produce a minimalist graphical display of the trend. The first step was to look at the data. Here is an extract of part of a message:


>From list@en.ucc.xe.net  Wed Sep 10 12:22:53 2003
...

XE.com's Currency Update Service writes:

Here is today's Currency Update, a service of XE.com. Please read the
copyright, terms of use agreement, and information sections at the
end of this message.  CUS5D0B3D5C16D9
____________________________________________________________________________

If you find our free currency e-mail updates useful, please forward this
message to a friend! Subscribe for free at: http://www.xe.com/cus/
____________________________________________________________________________
<PRE>

Rates as of 2003.09.09 20:46:35 UTC (GMT). Base currency is EUR.

Currency Unit                          EUR per Unit         Units per EUR
================================   ===================   ===================
USD United States Dollars                 0.890585              1.12286     
EUR Euro                                  1.00000               1.00000     
GBP United Kingdom Pounds                 1.41659               0.705920    
CAD Canada Dollars                        0.651411              1.53513     
...

</PRE>

For help reading this mailout, refer to: http://www.xe.com/cus/sample.htm

...

The ... lines indicate that I tossed out a lot of uninteresting lines.

I need three things from these e-mail messages to produce my report:

  • The "Rates as of" line to get the date

  • The "USD" line to get the actual conversion rate

  • The </PRE> line to tell me to print the info and clear my variables. I don't really have to clear them if the data is good, but it seemed like a good way to detect bad data. This is a quick hack, yes, but not a disgustingly quick hack.

The Solution

The numeric part of the solution is really easy. Simply grab the date information and the rate information. When I get to the </PRE> line, print it out.

The graphical portion is accomplished by printing a number of plus signs that corresponds to the rate. To get decent resolution, I would need either a wide printout or some sort of offset. I went for the offset, assuming the Euro would not drop below $.90, which was pretty safe considering the direction it had been going.

Finally, I wanted a heading. Using AWK's BEGIN block, I put in a couple of print statements. I don't like to count characters, so I defined the variable over to be the number of spaces that needs to be placed before the title information in order to align everything. Doing so simply meant I had to run the program, see how far off I was and adjust the variable. Here is the code:


BEGIN		{
		over = "                 "
		print over, " Cost of Euros in $ by date"
		print over, ".9       1.0       1.1       1.2       1.3"
		print over, "|         |         |         |         |"
		}
/Rates as of/	{ date = $4 }
/^USD/		{ rate = $6 }
/^<\/PRE>/	{
		printf "%s %6.3f ", date, rate
		rc = (rate - .895) * 100
		for (i=0; i < rc; i++) printf "+"
		printf "\n"
		date = "xxx"
		rate = 0
		}

Running the program with the mail file as input prints all the result lines, but the order is that of the data in the mail file. So it was the sort program to the rescue. The first field in the output is the date, and some careful choosing of the first character of the title lines means everything sorts correctly, with no options. Thus, to run the AWK program, use:


    awk -f cc.as messages | sort 

and you get your fancy report. Pipe the result thru more if you have a lot of lines to look at.

Here is a sample of the output from the AWK script:



                   Cost of Euros in $ by date
                  .9       1.0       1.1       1.2       1.3
                  |         |         |         |         |
2003.01.02  1.036 +++++++++++++++
...
2003.08.28  1.087 ++++++++++++++++++++
2003.08.29  1.098 +++++++++++++++++++++
2003.08.31  1.099 +++++++++++++++++++++
2003.09.01  1.097 +++++++++++++++++++++
2003.09.02  1.081 +++++++++++++++++++
2003.09.04  1.094 ++++++++++++++++++++
2003.09.05  1.110 ++++++++++++++++++++++
2003.09.07  1.110 ++++++++++++++++++++++
2003.09.08  1.107 ++++++++++++++++++++++
2003.09.09  1.123 +++++++++++++++++++++++
2003.09.10  1.121 +++++++++++++++++++++++
2003.09.11  1.120 +++++++++++++++++++++++
2003.09.12  1.129 ++++++++++++++++++++++++
2003.09.14  1.127 ++++++++++++++++++++++++
2003.09.15  1.128 ++++++++++++++++++++++++
2003.09.16  1.117 +++++++++++++++++++++++
2003.09.17  1.129 ++++++++++++++++++++++++
2003.09.18  1.124 +++++++++++++++++++++++
2003.09.19  1.138 +++++++++++++++++++++++++

Okay, sed experts, have at it.

Copyright (c) 2003, Phil Hughes. Originally published in Linux Gazette issue 95. Copyright (c) 2003, Specialized Systems Consultants, Inc.

Phil Hughes is Group Publisher for SSC Publishing, Ltd.

__________________________

Phil Hughes


Special Magazine Offer -- 2 Free Trial Issues!
Receive 2 free trial issues of Linux Journal as well as instant online access to current and past issues. There's NO RISK and NO OBLIGATION to buy. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Sorry, offer available in the US only. International orders, click here.

Post new comment

Please note that comments may not appear immediately, so there is no need to repost your comment.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.

More information about formatting options

Featured Videos

The October 9, 2008 edition of Linux Journal Live! Associate Editor, Shawn Powers, and Kyle Rankin, "Hack and /" columnist and author of Knoppix Hacks, Linux Multimedia Hacks, Knoppix Pocket Reference and others, discuss Linux distributions.

The October 2, 2008 edition of Linux Journal Live! Associate Editor, Shawn Powers, and Steven Evatt, Online Development manager for The Houston Chronicle discuss surviving disaster with Linux.

From the Magazine

November 2008, #175

There aren't many numbers that put the US national debt to shame, but here's one: 1,100,000,000,000,000. What's that? That's how many floating-point operations per second the Roadrunner supercomputer at Las Alamos can perform. That's about 100 FLOPS per dollar of US debt (unfortunately, the debt is winning the second derivative race). Read the article about Roadrunner in this month's High Performance Computing issue of LJ.

Along with that, find out how to program the Cell processor and how to use CUDA with your NVIDIA GPU. Also in this issue: Mr HandS (aka Kyle Rankin) gives us a few tips on using Compiz, Chef Marcel shows you how to get blogging off your plate quicker, Mick Bauer talks about Samba security, Dan Sawyer interviews Cory Doctrow and Doc talks about how information technology can affect democracy and fix the national debt (just kidding about that last part). That and more for your reading pleasure in this month's Linux Journal.

Read this issue