Hack and / - Linux Troubleshooting, Part I: High Load

What do you do when you get an alert that your system load is high? Tracking down the cause of high load just takes some time, some experience and a few Linux tools.

As you can see, there's a lot of information in only a few lines. The first line mirrors the information you would get from the uptime command and will update every few seconds with the latest load averages. In this case, you can see my system is busy, but not what I would call heavily loaded. All the same, this output breaks down well into our different load categories. When I troubleshoot a sluggish system, I generally will rule out CPU-bound load, then RAM issues, then finally I/O issues in that order, so let's start with CPU-bound load.

CPU-Bound Load

CPU-bound load is load caused when you have too many CPU-intensive processes running at once. Because each process needs CPU resources, they all must wait their turn. To check whether load is CPU-bound, check the CPU line in the top output:

Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st

Each of these percentages are a percentage of the CPU time tied up doing a particular task. Again, you could spend an entire column on all of the output from top, so here's a few of these values and how to read them:

  • us: user CPU time. More often than not, when you have CPU-bound load, it's due to a process run by a user on the system, such as Apache, MySQL or maybe a shell script. If this percentage is high, a user process such as those is a likely cause of the load.

  • sy: system CPU time. The system CPU time is the percentage of the CPU tied up by kernel and other system processes. CPU-bound load should manifest either as a high percentage of user or high system CPU time.

  • id: CPU idle time. This is the percentage of the time that the CPU spends idle. The higher the number here the better! In fact, if you see really high CPU idle time, it's a good indication that any high load is not CPU-bound.

  • wa: I/O wait. The I/O wait value tells the percentage of time the CPU is spending waiting on I/O (typically disk I/O). If you have high load and this value is high, it's likely the load is not CPU-bound but is due to either RAM issues or high disk I/O.

Track Down CPU-Bound Load

If you do see a high percentage in the user or system columns, there's a good chance your load is CPU-bound. To track down the root cause, skip down a few lines to where top displays a list of current processes running on the system. By default, top will sort these based on the percentage of CPU used with the processes using the most on top (Listing 2).

The %CPU column tells you just how much CPU each process is taking up. In this case, you can see that MySQL is taking up 53% of my CPU. As you look at this output during CPU-bound load, you probably will see one of two things: either you will have a single process tying up 99% of your CPU, or you will see a number of smaller processes all fighting for a percentage of CPU time. In either case, it's relatively simple to see the processes that are causing the problem. There's one final note I want to add on CPU-bound load: I've seen systems get incredibly high load simply because a multithreaded program spawned a huge number of threads on a system without many CPUs. If you spawn 20 threads on a single-CPU system, you might see a high load average, even though there are no particular processes that seem to tie up CPU time.

______________________

Kyle Rankin is a director of engineering operations in the San Francisco Bay Area, the author of a number of books including DevOps Troubleshooting and The Official Ubuntu Server Book, and is a columnist for Linux Journal.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Load averages

Chris Sardius's picture

I mange some quadcore linux systems and basically get a spike in load averages every now and then. I looking forward to deploying a permanent fix for this . I

uptime

slashbob's picture

Well, my Ubuntu 10.04 uptime only lasted about 35 days and then there was a kernel security update so I had to reboot.

From what I've seen Debian/Ubuntu and CentOS have kernel security updates about every three or four months.

Maybe it's time for FreeBSD...

in reply to uptime question

ayam666@hotmail.com's picture

"...what distro are you using to get 365 days of uptime..."

Just to comment on the question, I used to manage a few FreeBSD boxes (as internet gateway, mail server, DNS server and Samba server) for a company that virtually had no budget for their IT department.

We did not have brand named hardware. One of the servers was even built using part I scavenged from old boxes.

Most of them were very stable with uptime of more than 365 days.

The only time we had to switch off the server was because of a planned electricity upgrade by the electricity department - a possibility of outage of more than 5 hours just after midnight. Other than that, it was one case of bad sector on one of the mirrored hard disk.

In my opinion, it might not be Linux or the OS that causes low uptime. Rather, its the applications that we run on it bring down the system most of the time.

Like Kyle, I only upgraded the OS for security reasons.

Regards,

Yance

iotop on centos

Anonymous's picture

# rpm -i iotop-0.4-1.noarch.rpm
error: Failed dependencies:
python(abi) = 2.6 is needed by iotop-0.4-1.noarch

----------------
Setting up Install Process
Package python-2.4.3-27.el5.i386 already installed and latest version
-----------------

1. Where can I get a iotop compatible with python-2.4
2. If not, how can I upgrade my python on Centos 5?

Bob, maybe its not the distro

Anonymous's picture

Bob, maybe its not the distro itself that's causing your problem.

uptime

slashbob's picture

Jerry,

Thanks for the info. I guess Debian stable has, from my two years of experience, had kernel security updates at least every 3 months.

I was waiting to see which would come out first, Ubuntu 10.04 or CentOS 5.5, and had decided that which ever was released first is what I would use (for now). Knowing that Ubuntu would be supported for 5 years and CentOS for roughly the same time frame for 5.5 or 7 years from release.

Anyway, I've installed Ubuntu and it's been running nicely for 14 days now. When I ssh into the server it has an interesting summary of system information like: System Load, Swap usage, CPU temperature, Users logged in, Usage of /home, and then it tells me how many packages can be updated and how many are security updates which I thought was kind of nice.

During install it asked if I wanted to setup unattended-upgrades for security updates, I know it can be a little scary, but this is just a home server so I agreed. So everyday it checks for security updates and if there are any it installs them and sends me an email of what was done (Thanks must go to Kyle for his great tutorials of setting up a mail server with postfix. Thanks Kyle!).

We'll see what kinds of uptimes I get now.

Uptime...

Jerry McBride's picture

Say Bob,

I've been running Gentoo these last few years and the only thing that shortens the uptime on my servers is when a kernel comes out with new features or security updates. That aside, I can/could break the 365 day uptime with ease on these boxes.

The only real time I had problems getting a box to run for any lenght of time, was when I finally figured out that it had some badly manufactured memory in it. Swapped the junk out for some name brand sticks and it ran as I expected it to.

With linux, any problems with operation, I would suspect hardware issues before digging aroung the OS.

Just my two cents...

Jerry

---- Jerry McBride

I use a mix of distributions

Kyle Rankin's picture

I use a mix of distributions and have gotten 1-2 year uptimes on most of them. In production I'm also resistant to update something just so the version number is higher, so as long as I have reasonably stable applications and reliable power, it's not too difficult to maintain a high uptime. The main enemy of it these days is kernel upgrades, which again, I resist unless there is a good reason (security) to do so.

Kyle Rankin is a director of engineering operations in the San Francisco Bay Area, the author of a number of books including DevOps Troubleshooting and The Official Ubuntu Server Book, and is a columnist for Linux Journal.

uptime

slashbob's picture

So, the age old question ...

Kyle, what distro are you using to get 365 days of uptime?

I've been running Debian for a couple of years and never seen anything greater than 77 days. Thinking of switching my server to Slackware or something else.

/bob

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix