Hack and / - Linux Troubleshooting, Part I: High Load

What do you do when you get an alert that your system load is high? Tracking down the cause of high load just takes some time, some experience and a few Linux tools.
Out of RAM Issues

The next cause for high load is a system that has run out of available RAM and has started to go into swap. Because swap space is usually on a hard drive that is much slower than RAM, when you use up available RAM and go into swap, each process slows down dramatically as the disk gets used. Usually this causes a downward spiral as processes that have been swapped run slower, take longer to respond and cause more processes to stack up until the system either runs out of RAM or slows down to an absolute crawl. What's tricky about swap issues is that because they hit the disk so hard, it's easy to misdiagnose them as I/O-bound load. After all, if your disk is being used as RAM, any processes that actually want to access files on the disk are going to have to wait in line. So, if I see high I/O wait in the CPU row in top, I check RAM next and rule it out before I troubleshoot any other I/O issues.

When I want to diagnose out of memory issues, the first place I look is the next couple of lines in the top output:

Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers
Swap: 1004052k total, 4360k used, 999692k free, 286040k cached

These lines tell you the total amount of RAM and swap along with how much is used and free; however, look carefully, as these numbers can be misleading. I've seen many new and even experienced administrators who would look at the above output and conclude the system was almost out of RAM because there was only 26768k free. Although that does show how much RAM is currently unused, it doesn't tell the full story.

The Linux File Cache

When you access a file and the Linux kernel loads it into RAM, the kernel doesn't necessarily unload the file when you no longer need it. If there is enough free RAM available, the kernel tries to cache as many files as it can into RAM. That way, if you access the file a second time, the kernel can retrieve it from RAM instead of the disk and give much better performance. As a system stays running, you will find the free RAM actually will appear to get rather small. If a process needs more RAM though, the kernel simply uses some of its file cache. In fact, I see a lot of the overclocking crowd who want to improve performance and create a ramdisk to store their files. What they don't realize is that more often than not, if they just let the kernel do the work for them, they'd probably see much better results and make more efficient use of their RAM.

To get a more accurate amount of free RAM, you need to combine the values from the free column with the cached column. In my example, I would have 26768k + 286040k, or over 300Mb of free RAM. In this case, I could safely assume my system was not experiencing an out of RAM issue. Of course, even a system that has very little free RAM may not have gone into swap. That's why you also must check the Swap: line and see if a high proportion of your swap is being used.

Track Down High RAM Usage

If you do find you are low on free RAM, go back to the same process output from top, only this time, look in the %MEM column. By default, top will sort by the %CPU column, so simply type M and it will re-sort to show you which processes are using the highest percentage of RAM. In the output in Listing 3, I sorted the same processes by RAM, and you can see that the nagios2db_status process is using the most at 6.6%.

I/O-Bound Load

I/O-bound load can be tricky to track down sometimes. As I mentioned earlier, if your system is swapping, it can make the load appear to be I/O-bound. Once you rule out swapping though, if you do have a high I/O wait, the next step is to attempt to track down which disk and partition is getting the bulk of the I/O traffic. To do this, you need a tool like iostat.

The iostat tool, like top, is a complicated and full-featured tool that could fill up its own article. Unlike top, although it should be available for your distribution, it may not be installed on your system by default, so you need to track down which package provides it. Under Red Hat and Debian-based systems, you can get it in the sysstat package. Once it's installed, simply run iostat with no arguments to get a good overall view of your disk I/O statistics:

Linux 2.6.24-19-server (hostname) 	01/31/2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.73    0.07    2.03    0.53    0.00   91.64

Device:    tps  Blk_read/s  Blk_wrtn/s   Blk_read   Blk_wrtn
sda       9.82       417.96        27.53   30227262    1990625
sda1      6.55       219.10         7.12   15845129     515216
sda2      0.04         0.74         3.31      53506     239328
sda3      3.24       198.12        17.09   14328323    1236081

Like with top, iostat gives you the CPU percentage output. Below that, it provides a breakdown of each drive and partition on your system and statistics for each:

  • tps: transactions per second.

  • Blk_read/s: blocks read per second.

  • Blk_wrtn/s: blocks written per second.

  • Blk_read: total blocks read.

  • Blk_wrtn: total blocks written.

By looking at these different values and comparing them to each other, ideally you will be able to find out first, which partition (or partitions) is getting the bulk of the I/O traffic, and second, whether the majority of that traffic is reads (Blk_read/s) or writes (Blk_wrtn/s). As I said, tracking down the cause of I/O issues can be tricky, but hopefully, those values will help you isolate what processes might be causing the load.

For instance, if you have an I/O-bound load and you suspect that your remote backup job might be the culprit, compare the read and write statistics. Because you know that a remote backup job is primarily going to read from your disk, if you see that the majority of the disk I/O is writes, you reasonably can assume it's not from the backup job. If, on the other hand, you do see a heavy amount of read I/O on a particular partition, you might run the lsof command and grep for that backup process and see whether it does in fact have some open file handles on that partition.

As you can see, tracking down I/O issues with iostat is not straightforward. Even with no arguments, it can take some time and experience to make sense of the output. That said, iostat does have a number of arguments you can use to get more information about different types of I/O, including modes to find details about NFS shares. Check out the man page for iostat if you want to know more.

Up until recently, tools like iostat were about the limit systems administrators had in their toolboxes for tracking down I/O issues, but due to recent developments in the kernel, it has become easier to find the causes of I/O on a per-process level. If you have a relatively new system, check out the iotop tool. Like with iostat, it may not be installed by default, but as the name implies, it essentially acts like top, only for disk I/O. In Listing 4, you can see that an rsync process on this machine is using the most I/O (in this case, read I/O).

______________________

Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Load averages

Chris Sardius's picture

I mange some quadcore linux systems and basically get a spike in load averages every now and then. I looking forward to deploying a permanent fix for this . I

uptime

slashbob's picture

Well, my Ubuntu 10.04 uptime only lasted about 35 days and then there was a kernel security update so I had to reboot.

From what I've seen Debian/Ubuntu and CentOS have kernel security updates about every three or four months.

Maybe it's time for FreeBSD...

in reply to uptime question

ayam666@hotmail.com's picture

"...what distro are you using to get 365 days of uptime..."

Just to comment on the question, I used to manage a few FreeBSD boxes (as internet gateway, mail server, DNS server and Samba server) for a company that virtually had no budget for their IT department.

We did not have brand named hardware. One of the servers was even built using part I scavenged from old boxes.

Most of them were very stable with uptime of more than 365 days.

The only time we had to switch off the server was because of a planned electricity upgrade by the electricity department - a possibility of outage of more than 5 hours just after midnight. Other than that, it was one case of bad sector on one of the mirrored hard disk.

In my opinion, it might not be Linux or the OS that causes low uptime. Rather, its the applications that we run on it bring down the system most of the time.

Like Kyle, I only upgraded the OS for security reasons.

Regards,

Yance

iotop on centos

Anonymous's picture

# rpm -i iotop-0.4-1.noarch.rpm
error: Failed dependencies:
python(abi) = 2.6 is needed by iotop-0.4-1.noarch

----------------
Setting up Install Process
Package python-2.4.3-27.el5.i386 already installed and latest version
-----------------

1. Where can I get a iotop compatible with python-2.4
2. If not, how can I upgrade my python on Centos 5?

Bob, maybe its not the distro

Anonymous's picture

Bob, maybe its not the distro itself that's causing your problem.

uptime

slashbob's picture

Jerry,

Thanks for the info. I guess Debian stable has, from my two years of experience, had kernel security updates at least every 3 months.

I was waiting to see which would come out first, Ubuntu 10.04 or CentOS 5.5, and had decided that which ever was released first is what I would use (for now). Knowing that Ubuntu would be supported for 5 years and CentOS for roughly the same time frame for 5.5 or 7 years from release.

Anyway, I've installed Ubuntu and it's been running nicely for 14 days now. When I ssh into the server it has an interesting summary of system information like: System Load, Swap usage, CPU temperature, Users logged in, Usage of /home, and then it tells me how many packages can be updated and how many are security updates which I thought was kind of nice.

During install it asked if I wanted to setup unattended-upgrades for security updates, I know it can be a little scary, but this is just a home server so I agreed. So everyday it checks for security updates and if there are any it installs them and sends me an email of what was done (Thanks must go to Kyle for his great tutorials of setting up a mail server with postfix. Thanks Kyle!).

We'll see what kinds of uptimes I get now.

Uptime...

Jerry McBride's picture

Say Bob,

I've been running Gentoo these last few years and the only thing that shortens the uptime on my servers is when a kernel comes out with new features or security updates. That aside, I can/could break the 365 day uptime with ease on these boxes.

The only real time I had problems getting a box to run for any lenght of time, was when I finally figured out that it had some badly manufactured memory in it. Swapped the junk out for some name brand sticks and it ran as I expected it to.

With linux, any problems with operation, I would suspect hardware issues before digging aroung the OS.

Just my two cents...

Jerry

---- Jerry McBride

I use a mix of distributions

Kyle Rankin's picture

I use a mix of distributions and have gotten 1-2 year uptimes on most of them. In production I'm also resistant to update something just so the version number is higher, so as long as I have reasonably stable applications and reliable power, it's not too difficult to maintain a high uptime. The main enemy of it these days is kernel upgrades, which again, I resist unless there is a good reason (security) to do so.

Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.

uptime

slashbob's picture

So, the age old question ...

Kyle, what distro are you using to get 365 days of uptime?

I've been running Debian for a couple of years and never seen anything greater than 77 days. Thinking of switching my server to Slackware or something else.

/bob

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState