Hack and / - Chopping Logs
If you are a sysadmin, logs can be both a bane and a boon to your existence. On a bad day, a misbehaved program could dump gigabytes of errors into its log file, fill up the disk and light up your pager like a Christmas tree. On a good day, logs show you every clue you need to track down any of a hundred strange system problems. Now, if you manage any Web servers, logs provide even more valuable information in terms of statistics. How many visitors did you get to your main index page today? What spider is hammering your site right now?
Many excellent log-analysis tools exist. Some provide really nifty real-time visualizations of Web traffic, and others run every night and generate manager-friendly reports for you to browse. All of these programs are great, and I suggest you use them, but sometimes you need specific statistics and you need them now. For these on-the-fly statistics, I've developed a common template for a shell one-liner that chops through logs like Paul Bunyan.
What I've found is that although the specific type of information I need might change a little, for the most part, the algorithm remains mostly the same. For any log file, each line contains some bit of unique information I need. Then, I need to run through the log file, identify that information and keep a running tally that increments each time I see the particular pattern. Finally, I need to output that information along with its final tally and sort based on the tally.
There are many ways you can do this type of log parsing. Old-school command-line junkies might prefer a nice sed and awk approach. The whipper-snappers out there might pick a nicely formatted Python script. There's nothing at all wrong with those approaches, but I suppose I fall into the middle-child scripting category—I prefer Perl for this kind of text hacking. Maybe it's the power of Perl regular expressions, or maybe it's how easy it is to use Perl hashes, or maybe it's just what I'm most comfortable with, but I just seem to be able to hack out this kind of script much faster in Perl.
Before I give a sample script though, here's a more specific algorithm. The script parses through each line of input and uses a regular expression to match a particular column or other pattern of data on the line. It then uses that pattern as a key in a hash table and increments the value of that key. When it's done accepting input, the script iterates through each key in the hash and outputs the tally for that key and the key itself.
For the test case, I use a general-purpose problem you can try yourself, as long as you have an Apache Web server. I want to find out how many unique IP addresses visited one of my sites on November 1, 2008, and the top ten IPs in terms of hits.
Here's a sample entry from the log (the IP has been changed to protect the innocent):
123.123.12.34 - - [01/Nov/2008:19:34:02 -0700] "GET ↪/talks/pxe/ui/default/iepngfix.htc HTTP/1.1" ↪404 308 "-" "Mozilla/4.0 (compatible; MSIE 7.0; ↪Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; ↪Media Center PC 5.0; .NET CLR 3.0.04506; InfoPath.2)"
And, here's the one-liner that can parse the file and provide sorted output:
perl -e 'while(<>){ if( m|(^\d+\.\d+\.\d+\.\d+).*?
↪01/Nov/2008| ){ $v{$1}++; } } foreach( keys
↪%v ){ print "$v{$_}\t$_\n"; }'
↪/var/log/apache/access.log | sort -n
When you run this command, you should see output something like the following only with more lines and IPs that aren't fake:
33 99.99.99.99 94 111.111.111.111 138 15.15.15.15
For those of you who know and love both Perl and regular expressions, that one-liner probably isn't too difficult to parse, but for the rest of you, let's go step by step. Sometimes it's easier to go through a one-liner if you see it in a formatted way, so here's the Perl part of the one-liner translated as though it were in a regular file:
#!/usr/bin/perl
while(<>){
if(m|(^\d+\.\d+\.\d+\.\d+).*?01/Nov/2008|){
$v{$1}++;
}
}
foreach( keys %v ){
print "$v{$_}\t$_\n";
}
First, let's discuss the while loop. Basically, while(<>) iterates over every line of input it receives either through a pipe or as a file argument on the command line. Inside this loop, I set up a regular expression to match and pull out an IP address. The regular expression is probably worth looking at in more detail:
(^\d+\.\d+\.\d+\.\d+)
This section of the regular expression matches the beginning of the line (^), then any amount of numbers (\d+), and then a dot, another series of numbers, another dot, another series of numbers, another dot and finally a fourth series of numbers. This pattern will match, for instance, 123.123.12.34 at the beginning of a line. I surrounded this part of the regular expression in parentheses. Because this is the first set of parentheses, when Perl matches it, it puts the resultant match into the $1 variable so I can pull it out later.
Now, those of you who know regular expressions know that I cheated here. This regular expression is not very explicit at all. For one, it would match completely invalid IP addresses, such as 999.999.999.999. For another, it even would match any series of four numbers with dots in between, such as 12345.6.7.8910. I chose an overly generic regular expression on purpose to make a point. There are explicit regular expressions that match only valid IP addresses, but those expressions are very long, very complex and, in this case, completely unnecessary.
Because I'm dealing with Apache logs, I am pretty confident that the first set of numbers at the beginning of the file is an IP address and not something else, and second, the IP address that Apache logged should be reasonably valid. In taking the shortcut, I not only saved on typing, but the resulting regular expression also is easier to read and understand even if you aren't a regex wizard.
After I match the IP, I want to match only log entries from November 01, 2008:
.*?01/Nov/2008
This section performs matches on any number of characters (.*), and with the question mark at the end, it matches only as much as it needs to and no more. Then, it matches the datestamp for November 01, 2008. If I wanted a tally of every day in the log file, I could omit this entire section of the regular expression. Alternatively, if I wanted to match on some other keyword (for instance, when the user performed a GET on a particular file), I could replace or augment the above section with that keyword.
Once I have matched the IP address in a line and have assigned it to $1, I then use it as a key in a hash I call %v here and increment it ($h{$1}++). The power of a hash is that it forces each key to be unique. That means each time I come across a new IP, it will get its own key in the hash and have its value incremented. So, if it's the first time I see the IP, its value will be one. The second time I see the IP, it will increment it to two and so on.
Once I'm done iterating through each line in the file, I then drop to a foreach loop:
foreach( keys %v ){
print "$v{$_}\t$_\n";
}
Basically, all this does is increment through every key in the hash and output its value (the number of times I matched that IP in the file) and the IP itself. Note that I didn't sort the values here. I very well could have—Perl has powerful methods to sort output—but to make the code simpler and more flexible, I opted to pipe the output to the command-line sort command. That way, even if you don't know Perl too well but know the command line, you could tweak arguments in sort to reverse the output or even pipe it further to tail, so you could see only the top ten IPs.
If I want to know only the overall number of unique visitors, as each line represents a unique visitor, I just need to count the overall number of lines. To do this, I simply need to pipe the output to wc -l.
And, there you have it, a quick-and-dirty one-liner to chop through your logs and tally results. The beauty of using Perl hashes for this is that you can tweak the regular expression to match all sorts of values in the file—not just IP addresses—and tally all sorts of useful information. I've used modified versions of the script to count how many times a particular file was downloaded by unique IPs, and I've even used it to perform statistics on mailq output.
Kyle Rankin is a Senior Systems Administrator in the San Francisco Bay Area and the author of a number of books, including Knoppix Hacks and Ubuntu Hacks for O'Reilly Media. He is currently the president of the North Bay Linux Users' Group.
Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Dynamic DNS—an Object Lesson in Problem Solving
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- myip
57 min 51 sec ago - Keeping track of IP address
2 hours 48 min ago - Roll your own dynamic dns
8 hours 2 min ago - Please correct the URL for Salt Stack's web site
11 hours 13 min ago - Android is Linux -- why no better inter-operation
13 hours 29 min ago - Connecting Android device to desktop Linux via USB
13 hours 57 min ago - Find new cell phone and tablet pc
14 hours 55 min ago - Epistle
16 hours 24 min ago - Automatically updating Guest Additions
17 hours 33 min ago - I like your topic on android
18 hours 19 min ago




Comments
Hi, I think if you replace
Hi,
I think if you replace that regex with a split, it'll be faster.
agn
Typo error in the explanation
"Once I have matched the IP address in a line and have assigned it to $1, I then use it as a key in a hash I call %v here and increment it ($h{$1}++)."
I think that $h should have been $v (as in the script).