The Linux Scheduler
Last month, we started a new series on Linux kernel internals. In that first part, we looked at how Linux manages processes and why in many ways Linux is better at creating and maintaining processes than many commercial UNIX systems.
This time, we dwell a bit on the subject of scheduling. Much to my surprise, here again, Linux goes the unorthodox way and disregards conventional wisdom in kernel theory. The results are excellent. Let's see how.
In Linux 2.2.x there are three classes of processes, as can be seen from the data definition for the scheduler (from linux/include/linux/sched.h):
/* Scheduling Policies */ #define SCHED_OTHER 0 #define SCHED_FIFO 1 #define SCHED_RR 2
SCHED_OTHER tasks are the normal user tasks (default).
Tasks running in SCHED_FIFO will never be preempted. They will leave the CPU only for waiting sync kernel events or if an explicit sleep or reschedule has been requested from user space.
Tasks running in SCHED_RR are real time (RT), but they will leave the CPU if there is another real-time task in the run queue. So the CPU power will be distributed between all SCHED_RR tasks. If at least one RT task is running, no other SCHED_OTHER task will be allowed to run in any CPU. Each RT task has an rt_priority so the SCHED_RR class will be allowed to distribute the CPU power between all the SCHED_RR tasks at will. The rt_priority of the SCHED_RR class works exactly as the normal priority field for of the SCHED_OTHER (default) class.
Only the root user can change the class of the current task via the sched_setscheduler syscall.
One of the tasks of a kernel is to make sure the system remains firmly under its control even in situations of misbehaving programs. One such misbehaving program might fork too many processes too quickly. Thus, the kernel becomes so busy with itself that it cannot cater to its other responsibilities. I found out that Linux has no limit to how fast user-land programs can spawn children. HP-UX, Solaris and AIX have a limit of one fork per processor tick (called a jiffie under Linux). The patch in Listing 1 (see Resources) will allow a maximum of one fork per jiffie (one jiffie is usually 1/100 second, except on the Alpha architecture where it is 1/1024).
Threads are necessary to allow your process to make use of multiple processors. Linux doesn't really make any distinction between a process and a thread from a memory management and scheduling point of view. Some operating systems, like Solaris, manage threads within the user process by means of a thread scheduling library. The kernel sees only the process and doesn't know which thread, if any, is actually executing inside the user process. This saves the kernel from managing lists with thousands of entries for each thread for each process.
Obviously, the threads emulated on the top of one single user process won't be allowed to run concurrently on SMP, so the user-space approach won't scale very well on an SMP machine. Threading is strictly necessary only when all threads will be CPU-bound and not mainly I/O-oriented. If all the threads are CPU-bound, you definitely want to be able to scale for SMP.
Using threads only to wait for events is overkill. On the other hand, having threads sleeping is a waste of resources and performance. Almost all subsystems in Linux (such as TCP/IP) offer async event registration. Using async event via the SIGIO signal is similar to IRQ-driven handling.
With the user-space approach, you will at least avoid the TLB (translation lookaside buffer) flushing, as all the threads will share the same address space.
The advantage of having threads managed in user space through the threads library is the kernel will spend the scheduling CPU cost in user space. It is true that in user space, you may choose to implement a very fast round-robin scheduler that may cut down the scheduling cost, compared to the clever (but more expensive, in terms of execution path) Linux scheduler.
Speaking of SMP, as of Linux 2.4 I found there is no way to declare the processor affinity of any given user-space process.
The scheduler could keep track of the CPU affinity declaration of a process, or it could just determine a preferred CPU for a process by itself. The other day, together with Andrea Arcangeli of Italy, I designed the simple kernel patch in Listing 2 (see Resources) that implements processor affinity. Notice that processor affinity makes the most sense when other processes are excluded from running on this CPU. A better way to implement this patch would be to have the system administrator set affinity with an external call like nice.
The 2.2.x SMP kernel scheduler has some bugs that sometimes make it less effective than the UP (UniProcessor) scheduler. Nevertheless, Andrea fixed all such bugs and rewrote the heuristics from scratch and the SMP scheduler gives an impressive SMP improvement under load. The SMP changes are just in the 2.3.x kernels, and I plan to integrate it in 2.2.x also. You can get Andrea's patch at ftp.suse.com/pub/people/andrea/kernel-patches/my-2.2.12/SMP-scheduler-2_2_11-E. That patch can be used against 2.2.11 and 2.2.12 and speeds up both kernels on SMP systems. The patch was merged into 2.3.15, but not yet in 2.2.x because it's a performance issue only and not a true bug fix.
The SMP scheduler heuristic mechanism works as a function of (not in any particular order):
the idle CPUs
the last CPU where the wakenup task was running
the memory management of the task (for optimising kernel-threads reschedule)
the “goodness” of the tasks running on the busy CPUs
the time necessary to invalidate the L2 cache on the running CPU (cacheflush_time)
the average time slice (avg_slice) of the wakenup task (how much time the task runs before returning to sleep)
The algorithm collects the above data and chooses the best CPU on which to reschedule the wakenup task.
There are two paths involved in the Linux scheduler behavior:
schedule: the running/current task is a SCHED_OTHER task that expired its time slice (so the kernel runs a schedule while returning from the timer IRQ for switching to the next running task).
reschedule_idle: a task got a wakeup (usually from an IRQ), and so we try to reschedule such wakenup task in the best CPU by invoking a schedule on it (it's a kind of controlled schedule).
Both paths share the goodness function. The goodness function can be considered the core of the SMP scheduler. It calculates the “goodness” of a task as a function of the following:
the task currently running
the task that wants to run
the current CPU
A plain schedule only works based on goodness. As you can see in Listing 3, a plain schedule is SMP-aware. The goodness of the potential next task increases if its last CPU is the current CPU.
Nevertheless, reschedule_idle is far more critical for CPU affinity and scheduler latencies; for example, if you comment out reschedule_idle, the scheduler latency will become infinite. Also, reschedule_idle takes care of the cache-flush time and the task average time slice, and this is the truly interesting part of the SMP scheduler. In UP, reschedule_idle is not as interesting as the SMP version. Listing 4 (see Resources) is the reschedule_idle implementation taken from 2.3.26 (soon 2.4).
The final objective of reschedule_idle is to call a schedule on a CPU in order to reschedule the wakenup task in it. We use goodness in reschedule_idle because we want to predict the effect of the future schedule that we'll send to that CPU. By predicting the effect of the future schedule, we can choose the best CPU to reschedule at wakeup time. This, of course, saves us the trouble of executing on a CPU without the proper TLB settings. If the CPU to reschedule is not the current one, we send a reschedule event via inter-CPU message passing (SMP-IPI interrupt on i386).
To make it very clear: the goodness function is the core of the Linux scheduler and is SMP aware, while reschedule_idle is the core of the clever SMP heuristics.