The Linux Scheduler
Last month, we started a new series on Linux kernel internals. In that first part, we looked at how Linux manages processes and why in many ways Linux is better at creating and maintaining processes than many commercial UNIX systems.
This time, we dwell a bit on the subject of scheduling. Much to my surprise, here again, Linux goes the unorthodox way and disregards conventional wisdom in kernel theory. The results are excellent. Let's see how.
In Linux 2.2.x there are three classes of processes, as can be seen from the data definition for the scheduler (from linux/include/linux/sched.h):
/* Scheduling Policies */ #define SCHED_OTHER 0 #define SCHED_FIFO 1 #define SCHED_RR 2
SCHED_OTHER tasks are the normal user tasks (default).
Tasks running in SCHED_FIFO will never be preempted. They will leave the CPU only for waiting sync kernel events or if an explicit sleep or reschedule has been requested from user space.
Tasks running in SCHED_RR are real time (RT), but they will leave the CPU if there is another real-time task in the run queue. So the CPU power will be distributed between all SCHED_RR tasks. If at least one RT task is running, no other SCHED_OTHER task will be allowed to run in any CPU. Each RT task has an rt_priority so the SCHED_RR class will be allowed to distribute the CPU power between all the SCHED_RR tasks at will. The rt_priority of the SCHED_RR class works exactly as the normal priority field for of the SCHED_OTHER (default) class.
Only the root user can change the class of the current task via the sched_setscheduler syscall.
One of the tasks of a kernel is to make sure the system remains firmly under its control even in situations of misbehaving programs. One such misbehaving program might fork too many processes too quickly. Thus, the kernel becomes so busy with itself that it cannot cater to its other responsibilities. I found out that Linux has no limit to how fast user-land programs can spawn children. HP-UX, Solaris and AIX have a limit of one fork per processor tick (called a jiffie under Linux). The patch in Listing 1 (see Resources) will allow a maximum of one fork per jiffie (one jiffie is usually 1/100 second, except on the Alpha architecture where it is 1/1024).
Threads are necessary to allow your process to make use of multiple processors. Linux doesn't really make any distinction between a process and a thread from a memory management and scheduling point of view. Some operating systems, like Solaris, manage threads within the user process by means of a thread scheduling library. The kernel sees only the process and doesn't know which thread, if any, is actually executing inside the user process. This saves the kernel from managing lists with thousands of entries for each thread for each process.
Obviously, the threads emulated on the top of one single user process won't be allowed to run concurrently on SMP, so the user-space approach won't scale very well on an SMP machine. Threading is strictly necessary only when all threads will be CPU-bound and not mainly I/O-oriented. If all the threads are CPU-bound, you definitely want to be able to scale for SMP.
Using threads only to wait for events is overkill. On the other hand, having threads sleeping is a waste of resources and performance. Almost all subsystems in Linux (such as TCP/IP) offer async event registration. Using async event via the SIGIO signal is similar to IRQ-driven handling.
With the user-space approach, you will at least avoid the TLB (translation lookaside buffer) flushing, as all the threads will share the same address space.
The advantage of having threads managed in user space through the threads library is the kernel will spend the scheduling CPU cost in user space. It is true that in user space, you may choose to implement a very fast round-robin scheduler that may cut down the scheduling cost, compared to the clever (but more expensive, in terms of execution path) Linux scheduler.
Speaking of SMP, as of Linux 2.4 I found there is no way to declare the processor affinity of any given user-space process.
The scheduler could keep track of the CPU affinity declaration of a process, or it could just determine a preferred CPU for a process by itself. The other day, together with Andrea Arcangeli of Italy, I designed the simple kernel patch in Listing 2 (see Resources) that implements processor affinity. Notice that processor affinity makes the most sense when other processes are excluded from running on this CPU. A better way to implement this patch would be to have the system administrator set affinity with an external call like nice.
The 2.2.x SMP kernel scheduler has some bugs that sometimes make it less effective than the UP (UniProcessor) scheduler. Nevertheless, Andrea fixed all such bugs and rewrote the heuristics from scratch and the SMP scheduler gives an impressive SMP improvement under load. The SMP changes are just in the 2.3.x kernels, and I plan to integrate it in 2.2.x also. You can get Andrea's patch at ftp.suse.com/pub/people/andrea/kernel-patches/my-2.2.12/SMP-scheduler-2_2_11-E. That patch can be used against 2.2.11 and 2.2.12 and speeds up both kernels on SMP systems. The patch was merged into 2.3.15, but not yet in 2.2.x because it's a performance issue only and not a true bug fix.
The SMP scheduler heuristic mechanism works as a function of (not in any particular order):
the idle CPUs
the last CPU where the wakenup task was running
the memory management of the task (for optimising kernel-threads reschedule)
the “goodness” of the tasks running on the busy CPUs
the time necessary to invalidate the L2 cache on the running CPU (cacheflush_time)
the average time slice (avg_slice) of the wakenup task (how much time the task runs before returning to sleep)
The algorithm collects the above data and chooses the best CPU on which to reschedule the wakenup task.
There are two paths involved in the Linux scheduler behavior:
schedule: the running/current task is a SCHED_OTHER task that expired its time slice (so the kernel runs a schedule while returning from the timer IRQ for switching to the next running task).
reschedule_idle: a task got a wakeup (usually from an IRQ), and so we try to reschedule such wakenup task in the best CPU by invoking a schedule on it (it's a kind of controlled schedule).
Both paths share the goodness function. The goodness function can be considered the core of the SMP scheduler. It calculates the “goodness” of a task as a function of the following:
the task currently running
the task that wants to run
the current CPU
A plain schedule only works based on goodness. As you can see in Listing 3, a plain schedule is SMP-aware. The goodness of the potential next task increases if its last CPU is the current CPU.
Nevertheless, reschedule_idle is far more critical for CPU affinity and scheduler latencies; for example, if you comment out reschedule_idle, the scheduler latency will become infinite. Also, reschedule_idle takes care of the cache-flush time and the task average time slice, and this is the truly interesting part of the SMP scheduler. In UP, reschedule_idle is not as interesting as the SMP version. Listing 4 (see Resources) is the reschedule_idle implementation taken from 2.3.26 (soon 2.4).
The final objective of reschedule_idle is to call a schedule on a CPU in order to reschedule the wakenup task in it. We use goodness in reschedule_idle because we want to predict the effect of the future schedule that we'll send to that CPU. By predicting the effect of the future schedule, we can choose the best CPU to reschedule at wakeup time. This, of course, saves us the trouble of executing on a CPU without the proper TLB settings. If the CPU to reschedule is not the current one, we send a reschedule event via inter-CPU message passing (SMP-IPI interrupt on i386).
To make it very clear: the goodness function is the core of the Linux scheduler and is SMP aware, while reschedule_idle is the core of the clever SMP heuristics.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




4 hours 58 min ago
15 hours 38 min ago
21 hours 24 min ago
21 hours 42 min ago
23 hours 35 min ago
1 day 1 hour ago
1 day 8 hours ago
1 day 8 hours ago
1 day 10 hours ago
1 day 16 hours ago