Work the Shell - Analyzing Log Files Redux

If you want an easy way to calculate the amount of data transferred from a log file, you can always look awk-ward.

Last month, we spent a lot of time digging around in the Apache log files to see how you can use basic Linux commands to ascertain some basic statistics about your Web site.

You'll recall that even simple commands, such as head, tail and wc can help you figure out things like hits per hour and, coupled with some judicious uses of grep, can show you how many graphics you sent, which HTML files were most popular and so on.

More important, utilizing awk at its most rudimentary made it easy to cut out a specific column of information and see that different fields of a standard Apache log file entry have different values. This month, I dig further into the log files and explore how you can utilize more sophisticated scripting to ascertain total bytes transferred for a given time unit.

How Much Data Have You Transferred?

Many ISPs have a maximum allocation for your monthly bandwidth, so it's important to be able to figure out how much data you've sent. Let's examine a single log file entry to see where the bytes-sent field is found:

72.82.44.66 - - [11/Jul/2006:22:15:14 -0600] "GET
↪/individual-entry-javascript.js HTTP/1.1" 200 2374
↪"http://www.askdavetaylor.com/
↪sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html"
↪"Mozilla/4.0 (compatible; MSIE 6.0;
↪Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR
↪2.0.50727)"

There are a lot of different fields of data here, but the one we want is field #10, which in this instance is 2374. Double-check on the filesystem, and you'll find out that this is the size, in bytes, of the file sent, whether it be a graphic, HTML file or, as in this case, a JavaScript include file.

Therefore, if you can extract all the field #10 values in the log file and summarize them, you can figure out total bytes transferred. Extracting the field is easy; adding it all up is trickier, however:

$ awk '{ print $10 }' access_log

That gets us all the transfer sizes, and we can use awk's capabilities to make summarizing a single-line command too:

$ awk '{ sum += $10 } END { print sum }' access_log

As I have said before, awk has lots of power for those people willing to spend a little time learning its ins and outs. Notice a lazy shortcut here: I'm not initializing the variable sum, just exploiting the fact that variables, when first allocated in awk, are set to zero. Not all scripting languages offer this shortcut!

Anyway, run this little one-liner on an access log, and you can see the total number of bytes transferred: 354406825. I can divide that out by 1024 to figure out kilobytes, megabytes and so on, but that's not useful information until we can figure out one more thing: what length of time is this covering?

We can calculate elapsed time by looking at the first and last lines of the log file and calculating the difference, or we simply can use grep to pull one day's worth of data out of the log file and then multiply the result by 30 to get a running average monthly transfer rate.

Look back at the log file entry; the date is formatted like so: - [11/Jul/2006:22:15:14 -0600]. Ignore everything other than the fact that the date format is DD/MMM/YYYY.

I'll test it with 08/Aug/2006 to pull out just that one day's worth of log entries and then feed it into the awk script:

$ grep "08/Aug/2006" access_log | awk '{ sum += $10 }
↪END { print sum }'
78233022

Just a very rough estimate: 78MB. Multiply that by 30 and we'll get 2.3GB for that Web site's monthly data transfer rate.

Turning This into a Shell Script

Now, let's turn this into an actual shell script. What I'd like to do is pull out the previous day's data from the log file and automatically multiply it by 30, so any time the command is run, we can get a rough idea of the monthly data transfer rate.

The first step is to do some date math. I am going to make the rash assumption that you have GNU date on your system, which allows date math. If not, well, that's beyond the scope of this piece, though I do talk about it in my book Wicked Cool Shell Scripts (www.intuitive.com/wicked).

GNU date lets you back up arbitrary time units by using the -v option, with modifiers. To back up a day, use -v-1d. For example:

$ date
Wed Aug  9 01:00:00 GMT 2006
$ date -v-1d
Tue Aug  8 01:00:47 GMT 2006

The other neat trick the date command can do is to print its output in whatever format you need, using the many, many options detailed in the strftime(3) man page. To get DD/MMM/YYYY, we add a format string:

$ date -v-1d +%d/%b/%Y
08/Aug/2006

Now, let's start pulling the script together. The first step in the script is to create this date string so we can use it for the grep call, then go ahead and extract and summarize the bytes transferred that day. Next, we can use those values to calculate other values with the expr command, saving everything in variables so we can have some readable output at the end.

Here's my script, with just a little bit of fancy footwork:

#!/bin/sh

LOGFILE="/home/limbo1/logs/intuitive/access_log"

yesterday="$(date -v-1d +%d/%b/%Y)"

# total number of "hits" and "bytes" yesterday:

hits="$(grep "$yesterday" $LOGFILE | wc -l)"

bytes="$(grep "$yesterday" $LOGFILE | awk '{ sum += $10 }
END { print sum }')"

# now let's play with the data just a bit

avgbytes="$(expr $bytes / $hits )"
monthbytes="$(expr $bytes \* 30 )"

# calculated, let's now display the results:

echo "Calculating transfer data for $yesterday"
echo "Sent $bytes bytes of data across $hits hits"
echo "For an average of $avgbytes bytes/hit"
echo "Estimated monthly transfer rate: $monthbytes"

exit 0

Run the script, and here's the kind of data you'll get (once you point the LOGFILE variable to your own log):

$ ./transferred.sh
Calculating transfer data for 08/Aug/2006
Sent 78233022 bytes of data across 15093 hits
For an average of 5183 bytes/hit
Estimated monthly transfer rate: 2346990660

We've run out of space this month, but next month, we'll go back to this script and add some code to have the transfer rates displayed in megabytes or, if that's still too big, gigabytes. After all, an estimated monthly transfer rate of 2346990660 is a value that only a true geek could love!

Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com.

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Fantastic

Misafir's picture

Fantastic thank you.

Webcast
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers

Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.

Learn More

Sponsored by AMD

White Paper
Red Hat White Paper: Using an Open Source Framework to Catch the Bad Guy

Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6

Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.

Learn more about catching the bad guy in this free white paper.

Learn More

Sponsored by DLT Solutions