Hack and / - Linux Troubleshooting, Part I: High Load
This column is the first in a series of columns dedicated to one of my favorite subjects: troubleshooting. I'm a systems administrator during the day, and although I enjoy many aspects of my job, it's hard to beat the adrenaline rush of tracking down a complex server problem when downtime is being measured in dollars. Although it's true that there are about as many different reasons for downtime as there are Linux text editors, and just as many approaches to troubleshooting, over the years, I've found I perform the same sorts of steps to isolate a problem. Because my column is generally aimed more at tips and tricks and less on philosophy and design, I'm not going to talk much about overall approaches to problem solving. Instead, in this series I describe some general classes of problems you might find on a Linux system, and then I discuss how to use common tools, most of which probably are already on your system, to isolate and resolve each class of problem.
For this first column, I start with one of the most common problems you will run into on a Linux system. No, it's not getting printing to work. I'm talking about a sluggish server that might have high load. Before I explain how to diagnose and fix high load though, let's take a step back and discuss what load means on a Linux machine and how to know when it's high.
When administrators mention high load, generally they are talking about the load average. When I diagnose why a server is slow, the first command I run when I log in to the system is uptime:
$ uptime 18:30:35 up 365 days, 5:29, 2 users, load average: 1.37, 10.15, 8.10
As you can see, it's my server's uptime birthday today. You also can see that my load average is 1.37, 10.15, 8.10. These numbers represent my average system load during the last 1, 5 and 15 minutes, respectively. Technically speaking, the load average represents the average number of processes that have to wait for CPU time during the last 1, 5 or 15 minutes. For instance, if I have a current load of 0, the system is completely idle. If I have a load of 1, the CPU is busy enough that one process is having to wait for CPU time. If I do have a load of 1 and then spawn another process that normally would tie up a CPU, my load should go to 2. With a load average, the system will give you a good idea of how consistently busy it has been over the past 1, 5 and 10 minutes.
Another important thing to keep in mind when you look at a load average is that it isn't normalized according to the number of CPUs on your system. Generally speaking, a consistent load of 1 means one CPU on the system is tied up. In simplified terms, this means that a single-CPU system with a load of 1 is roughly as busy as a four-CPU system with a load of 4. So in my above example, let's assume that I have a single-CPU system. If I were to log in and see the above load average, I'd probably assume that the server had pretty high load (8.10) during the last 15 minutes that spiked around 5 minutes ago (10.15), but recently, at least during the last 1 minute, the load has dropped significantly. If I saw this, I might even assume that the real cause of the load has subsided. On the other hand, if the load averages were 20.68, 5.01, 1.03, I would conclude that the high load had likely started in the last 5 minutes and was getting worse.
After you understand what load average means, the next logical question is “What load average is good and what is bad?” The answer to that is “It depends.” You see, a lot of different things can cause load to be high, each of which affects performance differently. One server might have a load of 50 and still be pretty responsive, while another server might have a load of 10 and take forever to log in to. I've had servers with load averages in the hundreds that were certainly slow, but didn't crash, and I had one server that consistently had a load of 50 that was still pretty responsive and stayed up for years.
What really matters when you troubleshoot a system with high load is why the load is high. When you start to diagnose high load, you find that most load seems to fall into three categories: CPU-bound load, load caused by out of memory issues and I/O-bound load. I explain each of these categories in detail below and how to use tools like top and iostat to isolate the root cause.
If the first tool I use when I log in to a sluggish system is uptime, the second tool I use is top. The great thing about top is that it's available for all major Linux systems, and it provides a lot of useful information in a single screen. top is a quite complex tool with many options that could warrant its own article. For this column, I stick to how to interpret its output to diagnose high load.
To use top, simply type top on the command line. By default, top will run in interactive mode and update its output every few seconds. Listing 1 shows sample top output from a terminal.
Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- RSS Feeds
- Readers' Choice Awards
- Tech Tip: Really Simple HTTP Server with Python
- BASH script to log IPs on public web server
12 min 11 sec ago - DynDNS
3 hours 47 min ago - Reply to comment | Linux Journal
4 hours 20 min ago - All the articles you talked
6 hours 43 min ago - All the articles you talked
6 hours 47 min ago - All the articles you talked
6 hours 48 min ago - myip
11 hours 13 min ago - Keeping track of IP address
13 hours 4 min ago - Roll your own dynamic dns
18 hours 17 min ago - Please correct the URL for Salt Stack's web site
21 hours 29 min ago




Comments
Load averages
I mange some quadcore linux systems and basically get a spike in load averages every now and then. I looking forward to deploying a permanent fix for this . I
uptime
Well, my Ubuntu 10.04 uptime only lasted about 35 days and then there was a kernel security update so I had to reboot.
From what I've seen Debian/Ubuntu and CentOS have kernel security updates about every three or four months.
Maybe it's time for FreeBSD...
in reply to uptime question
"...what distro are you using to get 365 days of uptime..."
Just to comment on the question, I used to manage a few FreeBSD boxes (as internet gateway, mail server, DNS server and Samba server) for a company that virtually had no budget for their IT department.
We did not have brand named hardware. One of the servers was even built using part I scavenged from old boxes.
Most of them were very stable with uptime of more than 365 days.
The only time we had to switch off the server was because of a planned electricity upgrade by the electricity department - a possibility of outage of more than 5 hours just after midnight. Other than that, it was one case of bad sector on one of the mirrored hard disk.
In my opinion, it might not be Linux or the OS that causes low uptime. Rather, its the applications that we run on it bring down the system most of the time.
Like Kyle, I only upgraded the OS for security reasons.
Regards,
Yance
iotop on centos
# rpm -i iotop-0.4-1.noarch.rpm
error: Failed dependencies:
python(abi) = 2.6 is needed by iotop-0.4-1.noarch
----------------
Setting up Install Process
Package python-2.4.3-27.el5.i386 already installed and latest version
-----------------
1. Where can I get a iotop compatible with python-2.4
2. If not, how can I upgrade my python on Centos 5?
Bob, maybe its not the distro
Bob, maybe its not the distro itself that's causing your problem.
uptime
Jerry,
Thanks for the info. I guess Debian stable has, from my two years of experience, had kernel security updates at least every 3 months.
I was waiting to see which would come out first, Ubuntu 10.04 or CentOS 5.5, and had decided that which ever was released first is what I would use (for now). Knowing that Ubuntu would be supported for 5 years and CentOS for roughly the same time frame for 5.5 or 7 years from release.
Anyway, I've installed Ubuntu and it's been running nicely for 14 days now. When I ssh into the server it has an interesting summary of system information like: System Load, Swap usage, CPU temperature, Users logged in, Usage of /home, and then it tells me how many packages can be updated and how many are security updates which I thought was kind of nice.
During install it asked if I wanted to setup unattended-upgrades for security updates, I know it can be a little scary, but this is just a home server so I agreed. So everyday it checks for security updates and if there are any it installs them and sends me an email of what was done (Thanks must go to Kyle for his great tutorials of setting up a mail server with postfix. Thanks Kyle!).
We'll see what kinds of uptimes I get now.
Uptime...
Say Bob,
I've been running Gentoo these last few years and the only thing that shortens the uptime on my servers is when a kernel comes out with new features or security updates. That aside, I can/could break the 365 day uptime with ease on these boxes.
The only real time I had problems getting a box to run for any lenght of time, was when I finally figured out that it had some badly manufactured memory in it. Swapped the junk out for some name brand sticks and it ran as I expected it to.
With linux, any problems with operation, I would suspect hardware issues before digging aroung the OS.
Just my two cents...
Jerry
---- Jerry McBride
I use a mix of distributions
I use a mix of distributions and have gotten 1-2 year uptimes on most of them. In production I'm also resistant to update something just so the version number is higher, so as long as I have reasonably stable applications and reliable power, it's not too difficult to maintain a high uptime. The main enemy of it these days is kernel upgrades, which again, I resist unless there is a good reason (security) to do so.
Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.
uptime
So, the age old question ...
Kyle, what distro are you using to get 365 days of uptime?
I've been running Debian for a couple of years and never seen anything greater than 77 days. Thinking of switching my server to Slackware or something else.
/bob