Hack and / - Linux Troubleshooting, Part I: High Load
Listing 1. Sample top Output
top - 14:08:25 up 38 days, 8:02, 1 user, load average: 1.70, 1.77, 1.68 Tasks: 107 total, 3 running, 104 sleeping, 0 stopped, 0 zombie Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers Swap: 1004052k total, 4360k used, 999692k free, 286040k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9463 mysql 16 0 686m 111m 3328 S 53 5.5 569:17.64 mysqld 18749 nagios 16 0 140m 134m 1868 S 12 6.6 1345:01 nagios2db_status 24636 nagios 17 0 34660 10m 712 S 8 0.5 1195:15 nagios 22442 nagios 24 0 6048 2024 1452 S 8 0.1 0:00.04 check_time.pl
As you can see, there's a lot of information in only a few lines. The first line mirrors the information you would get from the uptime command and will update every few seconds with the latest load averages. In this case, you can see my system is busy, but not what I would call heavily loaded. All the same, this output breaks down well into our different load categories. When I troubleshoot a sluggish system, I generally will rule out CPU-bound load, then RAM issues, then finally I/O issues in that order, so let's start with CPU-bound load.
CPU-bound load is load caused when you have too many CPU-intensive processes running at once. Because each process needs CPU resources, they all must wait their turn. To check whether load is CPU-bound, check the CPU line in the top output:
Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st
Each of these percentages are a percentage of the CPU time tied up doing a particular task. Again, you could spend an entire column on all of the output from top, so here's a few of these values and how to read them:
us: user CPU time. More often than not, when you have CPU-bound load, it's due to a process run by a user on the system, such as Apache, MySQL or maybe a shell script. If this percentage is high, a user process such as those is a likely cause of the load.
sy: system CPU time. The system CPU time is the percentage of the CPU tied up by kernel and other system processes. CPU-bound load should manifest either as a high percentage of user or high system CPU time.
id: CPU idle time. This is the percentage of the time that the CPU spends idle. The higher the number here the better! In fact, if you see really high CPU idle time, it's a good indication that any high load is not CPU-bound.
wa: I/O wait. The I/O wait value tells the percentage of time the CPU is spending waiting on I/O (typically disk I/O). If you have high load and this value is high, it's likely the load is not CPU-bound but is due to either RAM issues or high disk I/O.
If you do see a high percentage in the user or system columns, there's a good chance your load is CPU-bound. To track down the root cause, skip down a few lines to where top displays a list of current processes running on the system. By default, top will sort these based on the percentage of CPU used with the processes using the most on top (Listing 2).
Listing 2. Current Processes Example
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9463 mysql 16 0 686m 111m 3328 S 53 5.5 569:17.64 mysqld 18749 nagios 16 0 140m 134m 1868 S 12 6.6 1345:01 nagios2db_status 24636 nagios 17 0 34660 10m 712 S 8 0.5 1195:15 nagios 22442 nagios 24 0 6048 2024 1452 S 8 0.1 0:00.04 check_time.pl
The %CPU column tells you just how much CPU each process is taking up. In this case, you can see that MySQL is taking up 53% of my CPU. As you look at this output during CPU-bound load, you probably will see one of two things: either you will have a single process tying up 99% of your CPU, or you will see a number of smaller processes all fighting for a percentage of CPU time. In either case, it's relatively simple to see the processes that are causing the problem. There's one final note I want to add on CPU-bound load: I've seen systems get incredibly high load simply because a multithreaded program spawned a huge number of threads on a system without many CPUs. If you spawn 20 threads on a single-CPU system, you might see a high load average, even though there are no particular processes that seem to tie up CPU time.
Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?
|Designing Electronics with Linux||May 22, 2013|
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
- RSS Feeds
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Designing Electronics with Linux
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- What's the tweeting protocol?
- Kernel Problem
6 hours 12 min ago
- BASH script to log IPs on public web server
10 hours 39 min ago
14 hours 15 min ago
- Reply to comment | Linux Journal
14 hours 47 min ago
- All the articles you talked
17 hours 11 min ago
- All the articles you talked
17 hours 14 min ago
- All the articles you talked
17 hours 15 min ago
21 hours 40 min ago
- Keeping track of IP address
23 hours 31 min ago
- Roll your own dynamic dns
1 day 4 hours ago