Counting with uniq
One of the truly great qualities of UNIX-like operating systems is their ability to combine multiple commands. By combining commands, you can perform a wide array of tasks, limited only by your cleverness and imagination.
Although the number of potential command combinations is huge, my experience has shown that certain combinations come in handy more often than others. One I turn to frequently is combining the sort and uniq commands to count occurrences of arbitrary strings in a file. This is a great trick for new Linux users and one you never will regret adding to your skill set.
Let's look at a simple example first to highlight the fundamental concepts. Given a file called fruit with the following contents:
apples oranges apples
you can discover how many times each word appears, as follows:
% sort fruit | uniq -c 1 oranges 2 apples
What's happening here? First, sort fruit sorts the file. The result ordinarily would go to the standard output (in this case, your terminal), but note the | (pipe) that follows. That pipe directs the output of sort fruit to the input of the next command, uniq -c, which prints each line preceded by the number of times it occurred.
It's not obvious from the simple example why this is so powerful. However, it becomes clearer when the file at hand is, for instance, an Apache Web server access log with hundreds of thousands of lines. The access log contains a wealth of valuable information. By using sort and uniq, you can do a surprising amount of simple data analysis on the fly from the command line. Imagine a coworker desperately needs to know the ten IP addresses that requested a PHP script called foo.php most often in January. Moments later, you have the information she needs. How did you derive this information so fast? Let's look at the solution step by step.
For the sake of this exercise your server is logging in the following format:
192.168.1.100 - - [31/Jan/2004:23:25:54 -0800] "GET /index.php HTTP/1.1" 200 7741
The log contains data from many months, not only January 2004, so the first order of business is to use grep to limit our data set:
% grep Jan/2004 access.log
We then look for foo.php in the output:
% grep Jan/2004 access.log | grep foo.php
If we are to count occurrences of IP addresses, we better limit our output to only that one field, like so:
% grep Jan/2004 access.log | grep foo.php | awk '{ print $1 }'
A discussion of awk is beyond the scope of this article. For now, you need to understand only that awk '{ print $1 }' prints the first string before any whitespace on each line. In this case, it's the IP address.
Now, at last, we can apply sort and uniq. Here's the final command pipeline:
% grep Jan/2004 access.log | grep foo.php | \
awk '{ print $1 }' | sort -n | uniq -c | \
sort -rn | head
The backslash (\) indicates the command is continued on the next line. You can type the command as one long line without the backslashes or use them to break up a long pipeline into multiple lines on the screen.
You may have noticed that, unlike in our simple example, the first sort is a numeric sort (sort -n). This is appropriate because we are, after all, dealing with numbers.
The other difference is the inclusion of | sort -rn | head. The sort -rn command sorts the output of uniq -c in reverse numeric order. The head command prints only the first ten lines of output. The first ten lines are perfect for the task at hand because we want only the top ten:
43 12.175.0.35 16 216.88.158.142 12 66.77.73.85 9 66.127.251.42 7 66.196.72.78 7 66.196.72.28 7 66.196.72.10 7 66.147.154.3 7 192.168.1.1 6 66.196.72.64
You can change the functionality of this pipeline by making changes to any of the component commands. For instance, if you wanted to print the bottom ten instead of the top ten, you need change only head to tail.
Brian Tanaka has been a UNIX system administrator since 1994 and has worked for companies such as The Well, SGI, Intuit and RealNetworks. He can be reached at btanaka@well.com.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Designing Electronics with Linux
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Using Salt Stack and Vagrant for Drupal Development
- Reply to comment | Linux Journal
4 hours 12 min ago - Dynamic DNS
4 hours 46 min ago - Reply to comment | Linux Journal
5 hours 45 min ago - Reply to comment | Linux Journal
6 hours 35 min ago - Not free anymore
10 hours 37 min ago - Great
14 hours 24 min ago - Reply to comment | Linux Journal
14 hours 32 min ago - Understanding the Linux Kernel
16 hours 47 min ago - General
19 hours 16 min ago - Kernel Problem
1 day 5 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
sort|uniq can be replaced by awk
How about we just do this
grep Jan/2004 access.log | grep foo.php |
awk '{a[$1]++}END{for(i in a)print i, a[i]}'
Re: sort|uniq can be replaced by awk
He's right. It's trickier to remember, but way faster.