Work the Shell - Analyzing Log Files Redux

If you want an easy way to calculate the amount of data transferred from a log file, you can always look awk-ward.

Last month, we spent a lot of time digging around in the Apache log files to see how you can use basic Linux commands to ascertain some basic statistics about your Web site.

You'll recall that even simple commands, such as head, tail and wc can help you figure out things like hits per hour and, coupled with some judicious uses of grep, can show you how many graphics you sent, which HTML files were most popular and so on.

More important, utilizing awk at its most rudimentary made it easy to cut out a specific column of information and see that different fields of a standard Apache log file entry have different values. This month, I dig further into the log files and explore how you can utilize more sophisticated scripting to ascertain total bytes transferred for a given time unit.

How Much Data Have You Transferred?

Many ISPs have a maximum allocation for your monthly bandwidth, so it's important to be able to figure out how much data you've sent. Let's examine a single log file entry to see where the bytes-sent field is found:

72.82.44.66 - - [11/Jul/2006:22:15:14 -0600] "GET
↪/individual-entry-javascript.js HTTP/1.1" 200 2374
↪"http://www.askdavetaylor.com/
↪sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html"
↪"Mozilla/4.0 (compatible; MSIE 6.0;
↪Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR
↪2.0.50727)"

There are a lot of different fields of data here, but the one we want is field #10, which in this instance is 2374. Double-check on the filesystem, and you'll find out that this is the size, in bytes, of the file sent, whether it be a graphic, HTML file or, as in this case, a JavaScript include file.

Therefore, if you can extract all the field #10 values in the log file and summarize them, you can figure out total bytes transferred. Extracting the field is easy; adding it all up is trickier, however:

$ awk '{ print $10 }' access_log

That gets us all the transfer sizes, and we can use awk's capabilities to make summarizing a single-line command too:

$ awk '{ sum += $10 } END { print sum }' access_log

As I have said before, awk has lots of power for those people willing to spend a little time learning its ins and outs. Notice a lazy shortcut here: I'm not initializing the variable sum, just exploiting the fact that variables, when first allocated in awk, are set to zero. Not all scripting languages offer this shortcut!

Anyway, run this little one-liner on an access log, and you can see the total number of bytes transferred: 354406825. I can divide that out by 1024 to figure out kilobytes, megabytes and so on, but that's not useful information until we can figure out one more thing: what length of time is this covering?

We can calculate elapsed time by looking at the first and last lines of the log file and calculating the difference, or we simply can use grep to pull one day's worth of data out of the log file and then multiply the result by 30 to get a running average monthly transfer rate.

Look back at the log file entry; the date is formatted like so: - [11/Jul/2006:22:15:14 -0600]. Ignore everything other than the fact that the date format is DD/MMM/YYYY.

I'll test it with 08/Aug/2006 to pull out just that one day's worth of log entries and then feed it into the awk script:

$ grep "08/Aug/2006" access_log | awk '{ sum += $10 }
↪END { print sum }'
78233022

Just a very rough estimate: 78MB. Multiply that by 30 and we'll get 2.3GB for that Web site's monthly data transfer rate.

Turning This into a Shell Script

Now, let's turn this into an actual shell script. What I'd like to do is pull out the previous day's data from the log file and automatically multiply it by 30, so any time the command is run, we can get a rough idea of the monthly data transfer rate.

The first step is to do some date math. I am going to make the rash assumption that you have GNU date on your system, which allows date math. If not, well, that's beyond the scope of this piece, though I do talk about it in my book Wicked Cool Shell Scripts (www.intuitive.com/wicked).

GNU date lets you back up arbitrary time units by using the -v option, with modifiers. To back up a day, use -v-1d. For example:

$ date
Wed Aug  9 01:00:00 GMT 2006
$ date -v-1d
Tue Aug  8 01:00:47 GMT 2006

The other neat trick the date command can do is to print its output in whatever format you need, using the many, many options detailed in the strftime(3) man page. To get DD/MMM/YYYY, we add a format string:

$ date -v-1d +%d/%b/%Y
08/Aug/2006

Now, let's start pulling the script together. The first step in the script is to create this date string so we can use it for the grep call, then go ahead and extract and summarize the bytes transferred that day. Next, we can use those values to calculate other values with the expr command, saving everything in variables so we can have some readable output at the end.

Here's my script, with just a little bit of fancy footwork:

#!/bin/sh

LOGFILE="/home/limbo1/logs/intuitive/access_log"

yesterday="$(date -v-1d +%d/%b/%Y)"

# total number of "hits" and "bytes" yesterday:

hits="$(grep "$yesterday" $LOGFILE | wc -l)"

bytes="$(grep "$yesterday" $LOGFILE | awk '{ sum += $10 }
END { print sum }')"

# now let's play with the data just a bit

avgbytes="$(expr $bytes / $hits )"
monthbytes="$(expr $bytes \* 30 )"

# calculated, let's now display the results:

echo "Calculating transfer data for $yesterday"
echo "Sent $bytes bytes of data across $hits hits"
echo "For an average of $avgbytes bytes/hit"
echo "Estimated monthly transfer rate: $monthbytes"

exit 0

Run the script, and here's the kind of data you'll get (once you point the LOGFILE variable to your own log):

$ ./transferred.sh
Calculating transfer data for 08/Aug/2006
Sent 78233022 bytes of data across 15093 hits
For an average of 5183 bytes/hit
Estimated monthly transfer rate: 2346990660

We've run out of space this month, but next month, we'll go back to this script and add some code to have the transfer rates displayed in megabytes or, if that's still too big, gigabytes. After all, an estimated monthly transfer rate of 2346990660 is a value that only a true geek could love!

Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com.

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Fantastic

Misafir's picture

Fantastic thank you.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState