Work the Shell - Analyzing Log Files Redux
Last month, we spent a lot of time digging around in the Apache log files to see how you can use basic Linux commands to ascertain some basic statistics about your Web site.
You'll recall that even simple commands, such as head, tail and wc can help you figure out things like hits per hour and, coupled with some judicious uses of grep, can show you how many graphics you sent, which HTML files were most popular and so on.
More important, utilizing awk at its most rudimentary made it easy to cut out a specific column of information and see that different fields of a standard Apache log file entry have different values. This month, I dig further into the log files and explore how you can utilize more sophisticated scripting to ascertain total bytes transferred for a given time unit.
Many ISPs have a maximum allocation for your monthly bandwidth, so it's important to be able to figure out how much data you've sent. Let's examine a single log file entry to see where the bytes-sent field is found:
72.82.44.66 - - [11/Jul/2006:22:15:14 -0600] "GET ↪/individual-entry-javascript.js HTTP/1.1" 200 2374 ↪"http://www.askdavetaylor.com/ ↪sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html" ↪"Mozilla/4.0 (compatible; MSIE 6.0; ↪Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR ↪2.0.50727)"
There are a lot of different fields of data here, but the one we want is field #10, which in this instance is 2374. Double-check on the filesystem, and you'll find out that this is the size, in bytes, of the file sent, whether it be a graphic, HTML file or, as in this case, a JavaScript include file.
Therefore, if you can extract all the field #10 values in the log file and summarize them, you can figure out total bytes transferred. Extracting the field is easy; adding it all up is trickier, however:
$ awk '{ print $10 }' access_log
That gets us all the transfer sizes, and we can use awk's capabilities to make summarizing a single-line command too:
$ awk '{ sum += $10 } END { print sum }' access_log
As I have said before, awk has lots of power for those people willing to spend a little time learning its ins and outs. Notice a lazy shortcut here: I'm not initializing the variable sum, just exploiting the fact that variables, when first allocated in awk, are set to zero. Not all scripting languages offer this shortcut!
Anyway, run this little one-liner on an access log, and you can see the total number of bytes transferred: 354406825. I can divide that out by 1024 to figure out kilobytes, megabytes and so on, but that's not useful information until we can figure out one more thing: what length of time is this covering?
We can calculate elapsed time by looking at the first and last lines of the log file and calculating the difference, or we simply can use grep to pull one day's worth of data out of the log file and then multiply the result by 30 to get a running average monthly transfer rate.
Look back at the log file entry; the date is formatted like so: - [11/Jul/2006:22:15:14 -0600]. Ignore everything other than the fact that the date format is DD/MMM/YYYY.
I'll test it with 08/Aug/2006 to pull out just that one day's worth of log entries and then feed it into the awk script:
$ grep "08/Aug/2006" access_log | awk '{ sum += $10 }
↪END { print sum }'
78233022
Just a very rough estimate: 78MB. Multiply that by 30 and we'll get 2.3GB for that Web site's monthly data transfer rate.
Now, let's turn this into an actual shell script. What I'd like to do is pull out the previous day's data from the log file and automatically multiply it by 30, so any time the command is run, we can get a rough idea of the monthly data transfer rate.
The first step is to do some date math. I am going to make the rash assumption that you have GNU date on your system, which allows date math. If not, well, that's beyond the scope of this piece, though I do talk about it in my book Wicked Cool Shell Scripts (www.intuitive.com/wicked).
GNU date lets you back up arbitrary time units by using the -v option, with modifiers. To back up a day, use -v-1d. For example:
$ date Wed Aug 9 01:00:00 GMT 2006 $ date -v-1d Tue Aug 8 01:00:47 GMT 2006
The other neat trick the date command can do is to print its output in whatever format you need, using the many, many options detailed in the strftime(3) man page. To get DD/MMM/YYYY, we add a format string:
$ date -v-1d +%d/%b/%Y 08/Aug/2006
Now, let's start pulling the script together. The first step in the script is to create this date string so we can use it for the grep call, then go ahead and extract and summarize the bytes transferred that day. Next, we can use those values to calculate other values with the expr command, saving everything in variables so we can have some readable output at the end.
Here's my script, with just a little bit of fancy footwork:
#!/bin/sh
LOGFILE="/home/limbo1/logs/intuitive/access_log"
yesterday="$(date -v-1d +%d/%b/%Y)"
# total number of "hits" and "bytes" yesterday:
hits="$(grep "$yesterday" $LOGFILE | wc -l)"
bytes="$(grep "$yesterday" $LOGFILE | awk '{ sum += $10 }
END { print sum }')"
# now let's play with the data just a bit
avgbytes="$(expr $bytes / $hits )"
monthbytes="$(expr $bytes \* 30 )"
# calculated, let's now display the results:
echo "Calculating transfer data for $yesterday"
echo "Sent $bytes bytes of data across $hits hits"
echo "For an average of $avgbytes bytes/hit"
echo "Estimated monthly transfer rate: $monthbytes"
exit 0
Run the script, and here's the kind of data you'll get (once you point the LOGFILE variable to your own log):
$ ./transferred.sh Calculating transfer data for 08/Aug/2006 Sent 78233022 bytes of data across 15093 hits For an average of 5183 bytes/hit Estimated monthly transfer rate: 2346990660
We've run out of space this month, but next month, we'll go back to this script and add some code to have the transfer rates displayed in megabytes or, if that's still too big, gigabytes. After all, an estimated monthly transfer rate of 2346990660 is a value that only a true geek could love!
Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com.
Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Speed Up Your Web Site with Varnish | Jun 19, 2013 |
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Linux Systems Administrator
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- RSS Feeds
- Senior Perl Developer
- Technical Support Rep
- Non-Linux FOSS: libnotify, OS X Style
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- yea
2 min 34 sec ago - Reply to comment | Linux Journal
24 min 53 sec ago - Android has been dominating
29 min 25 sec ago - It is quiet helping
3 hours 15 min ago - Technology
3 hours 32 min ago - Reachli - Amplifying your
4 hours 48 min ago - excellent
5 hours 37 min ago - good point!
5 hours 40 min ago - Varnish works!
5 hours 49 min ago - Reply to comment | Linux Journal
6 hours 19 min ago
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Fantastic
Fantastic thank you.