Hack and / - Linux Troubleshooting, Part I: High Load

What do you do when you get an alert that your system load is high? Tracking down the cause of high load just takes some time, some experience and a few Linux tools.

This column is the first in a series of columns dedicated to one of my favorite subjects: troubleshooting. I'm a systems administrator during the day, and although I enjoy many aspects of my job, it's hard to beat the adrenaline rush of tracking down a complex server problem when downtime is being measured in dollars. Although it's true that there are about as many different reasons for downtime as there are Linux text editors, and just as many approaches to troubleshooting, over the years, I've found I perform the same sorts of steps to isolate a problem. Because my column is generally aimed more at tips and tricks and less on philosophy and design, I'm not going to talk much about overall approaches to problem solving. Instead, in this series I describe some general classes of problems you might find on a Linux system, and then I discuss how to use common tools, most of which probably are already on your system, to isolate and resolve each class of problem.

For this first column, I start with one of the most common problems you will run into on a Linux system. No, it's not getting printing to work. I'm talking about a sluggish server that might have high load. Before I explain how to diagnose and fix high load though, let's take a step back and discuss what load means on a Linux machine and how to know when it's high.

Uptime and Load

When administrators mention high load, generally they are talking about the load average. When I diagnose why a server is slow, the first command I run when I log in to the system is uptime:

$ uptime
 18:30:35 up 365 days, 5:29, 2 users, load average: 1.37, 10.15, 8.10

As you can see, it's my server's uptime birthday today. You also can see that my load average is 1.37, 10.15, 8.10. These numbers represent my average system load during the last 1, 5 and 15 minutes, respectively. Technically speaking, the load average represents the average number of processes that have to wait for CPU time during the last 1, 5 or 15 minutes. For instance, if I have a current load of 0, the system is completely idle. If I have a load of 1, the CPU is busy enough that one process is having to wait for CPU time. If I do have a load of 1 and then spawn another process that normally would tie up a CPU, my load should go to 2. With a load average, the system will give you a good idea of how consistently busy it has been over the past 1, 5 and 10 minutes.

Another important thing to keep in mind when you look at a load average is that it isn't normalized according to the number of CPUs on your system. Generally speaking, a consistent load of 1 means one CPU on the system is tied up. In simplified terms, this means that a single-CPU system with a load of 1 is roughly as busy as a four-CPU system with a load of 4. So in my above example, let's assume that I have a single-CPU system. If I were to log in and see the above load average, I'd probably assume that the server had pretty high load (8.10) during the last 15 minutes that spiked around 5 minutes ago (10.15), but recently, at least during the last 1 minute, the load has dropped significantly. If I saw this, I might even assume that the real cause of the load has subsided. On the other hand, if the load averages were 20.68, 5.01, 1.03, I would conclude that the high load had likely started in the last 5 minutes and was getting worse.

How High Is High?

After you understand what load average means, the next logical question is “What load average is good and what is bad?” The answer to that is “It depends.” You see, a lot of different things can cause load to be high, each of which affects performance differently. One server might have a load of 50 and still be pretty responsive, while another server might have a load of 10 and take forever to log in to. I've had servers with load averages in the hundreds that were certainly slow, but didn't crash, and I had one server that consistently had a load of 50 that was still pretty responsive and stayed up for years.

What really matters when you troubleshoot a system with high load is why the load is high. When you start to diagnose high load, you find that most load seems to fall into three categories: CPU-bound load, load caused by out of memory issues and I/O-bound load. I explain each of these categories in detail below and how to use tools like top and iostat to isolate the root cause.

top

If the first tool I use when I log in to a sluggish system is uptime, the second tool I use is top. The great thing about top is that it's available for all major Linux systems, and it provides a lot of useful information in a single screen. top is a quite complex tool with many options that could warrant its own article. For this column, I stick to how to interpret its output to diagnose high load.

To use top, simply type top on the command line. By default, top will run in interactive mode and update its output every few seconds. Listing 1 shows sample top output from a terminal.

______________________

Kyle Rankin is a director of engineering operations in the San Francisco Bay Area, the author of a number of books including DevOps Troubleshooting and The Official Ubuntu Server Book, and is a columnist for Linux Journal.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Load averages

Chris Sardius's picture

I mange some quadcore linux systems and basically get a spike in load averages every now and then. I looking forward to deploying a permanent fix for this . I

uptime

slashbob's picture

Well, my Ubuntu 10.04 uptime only lasted about 35 days and then there was a kernel security update so I had to reboot.

From what I've seen Debian/Ubuntu and CentOS have kernel security updates about every three or four months.

Maybe it's time for FreeBSD...

in reply to uptime question

ayam666@hotmail.com's picture

"...what distro are you using to get 365 days of uptime..."

Just to comment on the question, I used to manage a few FreeBSD boxes (as internet gateway, mail server, DNS server and Samba server) for a company that virtually had no budget for their IT department.

We did not have brand named hardware. One of the servers was even built using part I scavenged from old boxes.

Most of them were very stable with uptime of more than 365 days.

The only time we had to switch off the server was because of a planned electricity upgrade by the electricity department - a possibility of outage of more than 5 hours just after midnight. Other than that, it was one case of bad sector on one of the mirrored hard disk.

In my opinion, it might not be Linux or the OS that causes low uptime. Rather, its the applications that we run on it bring down the system most of the time.

Like Kyle, I only upgraded the OS for security reasons.

Regards,

Yance

iotop on centos

Anonymous's picture

# rpm -i iotop-0.4-1.noarch.rpm
error: Failed dependencies:
python(abi) = 2.6 is needed by iotop-0.4-1.noarch

----------------
Setting up Install Process
Package python-2.4.3-27.el5.i386 already installed and latest version
-----------------

1. Where can I get a iotop compatible with python-2.4
2. If not, how can I upgrade my python on Centos 5?

Bob, maybe its not the distro

Anonymous's picture

Bob, maybe its not the distro itself that's causing your problem.

uptime

slashbob's picture

Jerry,

Thanks for the info. I guess Debian stable has, from my two years of experience, had kernel security updates at least every 3 months.

I was waiting to see which would come out first, Ubuntu 10.04 or CentOS 5.5, and had decided that which ever was released first is what I would use (for now). Knowing that Ubuntu would be supported for 5 years and CentOS for roughly the same time frame for 5.5 or 7 years from release.

Anyway, I've installed Ubuntu and it's been running nicely for 14 days now. When I ssh into the server it has an interesting summary of system information like: System Load, Swap usage, CPU temperature, Users logged in, Usage of /home, and then it tells me how many packages can be updated and how many are security updates which I thought was kind of nice.

During install it asked if I wanted to setup unattended-upgrades for security updates, I know it can be a little scary, but this is just a home server so I agreed. So everyday it checks for security updates and if there are any it installs them and sends me an email of what was done (Thanks must go to Kyle for his great tutorials of setting up a mail server with postfix. Thanks Kyle!).

We'll see what kinds of uptimes I get now.

Uptime...

Jerry McBride's picture

Say Bob,

I've been running Gentoo these last few years and the only thing that shortens the uptime on my servers is when a kernel comes out with new features or security updates. That aside, I can/could break the 365 day uptime with ease on these boxes.

The only real time I had problems getting a box to run for any lenght of time, was when I finally figured out that it had some badly manufactured memory in it. Swapped the junk out for some name brand sticks and it ran as I expected it to.

With linux, any problems with operation, I would suspect hardware issues before digging aroung the OS.

Just my two cents...

Jerry

---- Jerry McBride

I use a mix of distributions

Kyle Rankin's picture

I use a mix of distributions and have gotten 1-2 year uptimes on most of them. In production I'm also resistant to update something just so the version number is higher, so as long as I have reasonably stable applications and reliable power, it's not too difficult to maintain a high uptime. The main enemy of it these days is kernel upgrades, which again, I resist unless there is a good reason (security) to do so.

Kyle Rankin is a director of engineering operations in the San Francisco Bay Area, the author of a number of books including DevOps Troubleshooting and The Official Ubuntu Server Book, and is a columnist for Linux Journal.

uptime

slashbob's picture

So, the age old question ...

Kyle, what distro are you using to get 365 days of uptime?

I've been running Debian for a couple of years and never seen anything greater than 77 days. Thinking of switching my server to Slackware or something else.

/bob

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState