Counting with uniq

 in
Shell experts make the best of simple combinations of standard utilities. Learn one of the most common examples of using two common commands together.

One of the truly great qualities of UNIX-like operating systems is their ability to combine multiple commands. By combining commands, you can perform a wide array of tasks, limited only by your cleverness and imagination.

Although the number of potential command combinations is huge, my experience has shown that certain combinations come in handy more often than others. One I turn to frequently is combining the sort and uniq commands to count occurrences of arbitrary strings in a file. This is a great trick for new Linux users and one you never will regret adding to your skill set.

A Simple Example

Let's look at a simple example first to highlight the fundamental concepts. Given a file called fruit with the following contents:

apples
oranges
apples

you can discover how many times each word appears, as follows:

% sort fruit | uniq -c
  1 oranges
  2 apples

What's happening here? First, sort fruit sorts the file. The result ordinarily would go to the standard output (in this case, your terminal), but note the | (pipe) that follows. That pipe directs the output of sort fruit to the input of the next command, uniq -c, which prints each line preceded by the number of times it occurred.

A More-Advanced Example

It's not obvious from the simple example why this is so powerful. However, it becomes clearer when the file at hand is, for instance, an Apache Web server access log with hundreds of thousands of lines. The access log contains a wealth of valuable information. By using sort and uniq, you can do a surprising amount of simple data analysis on the fly from the command line. Imagine a coworker desperately needs to know the ten IP addresses that requested a PHP script called foo.php most often in January. Moments later, you have the information she needs. How did you derive this information so fast? Let's look at the solution step by step.

For the sake of this exercise your server is logging in the following format:

192.168.1.100 - - [31/Jan/2004:23:25:54 -0800] "GET /index.php HTTP/1.1" 200 7741

The log contains data from many months, not only January 2004, so the first order of business is to use grep to limit our data set:

% grep Jan/2004 access.log

We then look for foo.php in the output:

% grep Jan/2004 access.log | grep foo.php

If we are to count occurrences of IP addresses, we better limit our output to only that one field, like so:

% grep Jan/2004 access.log | grep foo.php | awk '{ print $1 }'

A discussion of awk is beyond the scope of this article. For now, you need to understand only that awk '{ print $1 }' prints the first string before any whitespace on each line. In this case, it's the IP address.

Now, at last, we can apply sort and uniq. Here's the final command pipeline:


% grep Jan/2004 access.log | grep foo.php | \
awk '{ print $1 }' |  sort -n | uniq -c | \
sort -rn | head

The backslash (\) indicates the command is continued on the next line. You can type the command as one long line without the backslashes or use them to break up a long pipeline into multiple lines on the screen.

You may have noticed that, unlike in our simple example, the first sort is a numeric sort (sort -n). This is appropriate because we are, after all, dealing with numbers.

The other difference is the inclusion of | sort -rn | head. The sort -rn command sorts the output of uniq -c in reverse numeric order. The head command prints only the first ten lines of output. The first ten lines are perfect for the task at hand because we want only the top ten:

43 12.175.0.35
16 216.88.158.142
12 66.77.73.85
 9 66.127.251.42
 7 66.196.72.78
 7 66.196.72.28
 7 66.196.72.10
 7 66.147.154.3
 7 192.168.1.1
 6 66.196.72.64

You can change the functionality of this pipeline by making changes to any of the component commands. For instance, if you wanted to print the bottom ten instead of the top ten, you need change only head to tail.

Conclusion

Piping data through sort and uniq is exceedingly handy, and I hope reading about it whets your appetite for learning more about pipelines. For more information about any of the commands used in these examples, refer to the corresponding man pages.

Brian Tanaka has been a UNIX system administrator since 1994 and has worked for companies such as The Well, SGI, Intuit and RealNetworks. He can be reached at btanaka@well.com.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

sort|uniq can be replaced by awk

wgshi's picture

How about we just do this

grep Jan/2004 access.log | grep foo.php |
awk '{a[$1]++}END{for(i in a)print i, a[i]}'

Re: sort|uniq can be replaced by awk

Kris's picture

He's right. It's trickier to remember, but way faster.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState