Work the Shell - Analyzing Log Files Redux
Last month, we spent a lot of time digging around in the Apache log files to see how you can use basic Linux commands to ascertain some basic statistics about your Web site.
You'll recall that even simple commands, such as head, tail and wc can help you figure out things like hits per hour and, coupled with some judicious uses of grep, can show you how many graphics you sent, which HTML files were most popular and so on.
More important, utilizing awk at its most rudimentary made it easy to cut out a specific column of information and see that different fields of a standard Apache log file entry have different values. This month, I dig further into the log files and explore how you can utilize more sophisticated scripting to ascertain total bytes transferred for a given time unit.
Many ISPs have a maximum allocation for your monthly bandwidth, so it's important to be able to figure out how much data you've sent. Let's examine a single log file entry to see where the bytes-sent field is found:
72.82.44.66 - - [11/Jul/2006:22:15:14 -0600] "GET ↪/individual-entry-javascript.js HTTP/1.1" 200 2374 ↪"http://www.askdavetaylor.com/ ↪sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html" ↪"Mozilla/4.0 (compatible; MSIE 6.0; ↪Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR ↪2.0.50727)"
There are a lot of different fields of data here, but the one we want is field #10, which in this instance is 2374. Double-check on the filesystem, and you'll find out that this is the size, in bytes, of the file sent, whether it be a graphic, HTML file or, as in this case, a JavaScript include file.
Therefore, if you can extract all the field #10 values in the log file and summarize them, you can figure out total bytes transferred. Extracting the field is easy; adding it all up is trickier, however:
$ awk '{ print $10 }' access_log
That gets us all the transfer sizes, and we can use awk's capabilities to make summarizing a single-line command too:
$ awk '{ sum += $10 } END { print sum }' access_log
As I have said before, awk has lots of power for those people willing to spend a little time learning its ins and outs. Notice a lazy shortcut here: I'm not initializing the variable sum, just exploiting the fact that variables, when first allocated in awk, are set to zero. Not all scripting languages offer this shortcut!
Anyway, run this little one-liner on an access log, and you can see the total number of bytes transferred: 354406825. I can divide that out by 1024 to figure out kilobytes, megabytes and so on, but that's not useful information until we can figure out one more thing: what length of time is this covering?
We can calculate elapsed time by looking at the first and last lines of the log file and calculating the difference, or we simply can use grep to pull one day's worth of data out of the log file and then multiply the result by 30 to get a running average monthly transfer rate.
Look back at the log file entry; the date is formatted like so: - [11/Jul/2006:22:15:14 -0600]. Ignore everything other than the fact that the date format is DD/MMM/YYYY.
I'll test it with 08/Aug/2006 to pull out just that one day's worth of log entries and then feed it into the awk script:
$ grep "08/Aug/2006" access_log | awk '{ sum += $10 }
↪END { print sum }'
78233022
Just a very rough estimate: 78MB. Multiply that by 30 and we'll get 2.3GB for that Web site's monthly data transfer rate.
Now, let's turn this into an actual shell script. What I'd like to do is pull out the previous day's data from the log file and automatically multiply it by 30, so any time the command is run, we can get a rough idea of the monthly data transfer rate.
The first step is to do some date math. I am going to make the rash assumption that you have GNU date on your system, which allows date math. If not, well, that's beyond the scope of this piece, though I do talk about it in my book Wicked Cool Shell Scripts (www.intuitive.com/wicked).
GNU date lets you back up arbitrary time units by using the -v option, with modifiers. To back up a day, use -v-1d. For example:
$ date Wed Aug 9 01:00:00 GMT 2006 $ date -v-1d Tue Aug 8 01:00:47 GMT 2006
The other neat trick the date command can do is to print its output in whatever format you need, using the many, many options detailed in the strftime(3) man page. To get DD/MMM/YYYY, we add a format string:
$ date -v-1d +%d/%b/%Y 08/Aug/2006
Now, let's start pulling the script together. The first step in the script is to create this date string so we can use it for the grep call, then go ahead and extract and summarize the bytes transferred that day. Next, we can use those values to calculate other values with the expr command, saving everything in variables so we can have some readable output at the end.
Here's my script, with just a little bit of fancy footwork:
#!/bin/sh
LOGFILE="/home/limbo1/logs/intuitive/access_log"
yesterday="$(date -v-1d +%d/%b/%Y)"
# total number of "hits" and "bytes" yesterday:
hits="$(grep "$yesterday" $LOGFILE | wc -l)"
bytes="$(grep "$yesterday" $LOGFILE | awk '{ sum += $10 }
END { print sum }')"
# now let's play with the data just a bit
avgbytes="$(expr $bytes / $hits )"
monthbytes="$(expr $bytes \* 30 )"
# calculated, let's now display the results:
echo "Calculating transfer data for $yesterday"
echo "Sent $bytes bytes of data across $hits hits"
echo "For an average of $avgbytes bytes/hit"
echo "Estimated monthly transfer rate: $monthbytes"
exit 0
Run the script, and here's the kind of data you'll get (once you point the LOGFILE variable to your own log):
$ ./transferred.sh Calculating transfer data for 08/Aug/2006 Sent 78233022 bytes of data across 15093 hits For an average of 5183 bytes/hit Estimated monthly transfer rate: 2346990660
We've run out of space this month, but next month, we'll go back to this script and add some code to have the transfer rates displayed in megabytes or, if that's still too big, gigabytes. After all, an estimated monthly transfer rate of 2346990660 is a value that only a true geek could love!
Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com.
Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- RSS Feeds
- What's the tweeting protocol?
- Trying to Tame the Tablet
- Validate an E-Mail Address with PHP, the Right Way
- New Products
- Drupal is an Awesome CMS and a Crappy development framework
41 min 27 sec ago - IT industry leaders
3 hours 3 min ago - Reply to comment | Linux Journal
19 hours 52 min ago - Reply to comment | Linux Journal
22 hours 24 min ago - Reply to comment | Linux Journal
23 hours 42 min ago - great post
1 day 16 min ago - Google Docs
1 day 39 min ago - Reply to comment | Linux Journal
1 day 5 hours ago - Reply to comment | Linux Journal
1 day 6 hours ago - Web Hosting IQ
1 day 7 hours ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
Fantastic
Fantastic thank you.