Completely Fair Scheduler
In order for the CFS to emulate an “ideal, precise, multitasking CPU” by giving each runnable process an equal slice of execution time, CFS needs to have the following:
A mechanism to calculate what the fair CPU share is per process. This is achieved by using a system-wide runqueue fair_clock variable (cfs_rq->fair_clock). This fair clock runs at a fraction of real time, so that it runs at the ideal pace for a single task when there are N runnable tasks in the system. For example, if you have four runnable tasks, fair_clock increases at one-fourth of the speed of wall time (which means 25% fair CPU power).
A mechanism to keep track of the time for which each process was waiting while the CPU was assigned to the currently running task. This wait time is accumulated in the per-process variable wait_runtime (process->wait_runtime).
Red-Black Tree (RBTree)
A red-black tree is a type of self-balancing binary search tree—a data structure typically used to implement associative arrays. It is complex, but it has good worst-case running time for its operations and is efficient in practice. It can search, insert and delete in O(log n) time, where n is the number of elements in the tree. In red-black trees, the leaf nodes are not relevant and do not contain data. These leaves need not be explicit in computer memory—a null child pointer can encode the fact that this child is a leaf—but it simplifies some algorithms for operating on red-black trees if the leaves really are explicit nodes. To save memory, sometimes a single sentinel node performs the role of all leaf nodes; all references from internal nodes to leaf nodes then point to the sentinel node. (Source: Wikipedia.)
CFS uses the fair clock and wait runtime to keep all the runnable tasks sorted by the rq->fair_clock - p->wait_runtime key in the rbtree (see the Red-Black Tree sidebar). So, the leftmost task in the tree is the one with the “gravest CPU need”, and CFS picks the leftmost task and sticks to it. As the system progresses forward, newly awakened tasks are put into the tree farther and farther to the right—slowly but surely giving every task a chance to become the leftmost task and, thus, get on the CPU within a deterministic amount of time.
Because of this simple design, CFS no longer uses active and expired arrays and dispensed with sophisticated heuristics to mark tasks as interactive versus non-interactive.
CFS implements priorities by using weighted tasks—each task is assigned a weight based on its static priority. So, while running, the task with lower weight (lower-priority) will see time elapse at a faster rate than that of a higher-priority task. This means its wait_runtime will exhaust more quickly than that of a higher-priority task, so lower-priority tasks will get less CPU time compared to higher-priority tasks.
CFS has been modified a bit further in 2.6.24. Although the basic concept of fairness remains, a few implementation details have changed. Now, instead of chasing the global fair clock (rq->fair_clock), tasks chase each other. A clock per task, vruntime, is introduced, and an approximated average is used to initialize this clock for new tasks. Each task tracks its runtime and is queued in the RBTree using this parameter. So, the task that has run least (the one that has the gravest CPU need) is the leftmost node of the RBTree and will be picked up by the scheduler. See Resources for more details about this implementation.
In kernel 2.6.24, another major addition to CFS is group scheduling. Plain CFS tries to be fair to all the tasks running in the system. For example, let's say there is a total of 25 runnable processes in the system. CFS tries to be fair by allocating 4% of the CPU to all of them. However, let's say that out of these 25 processes, 20 belong to user A while 5 belong to user B. User B is at an inherent disadvantage, as A is getting more CPU power than B. Group scheduling tries to eliminate this problem. It first tries to be fair to a group and then to individual tasks within that group. So CFS, with group scheduling enabled, will allocate 50% of the CPU to each user A and B. The allocated 50% share of A will be divided fairly among A's 20 tasks, while the other 50% of the CPU time will be distributed fairly among B's 5 tasks.
With kernel 2.6.23, the Linux scheduler also has been made modular. Each scheduling policy (SCHED_FIFO, SCHED_RR, SCHED_OTHER and so on) can be implemented independently of the base scheduler code. This modularization is similar to object-oriented class hierarchies (Figure 3).
The core scheduler does not need to be aware of the implementation details of the individual scheduling policies. In kernel 2.6.23, sched.c (the “scheduler” from older kernels) is divided into the following files to make the scheduler modular:
kernel/sched.c: contains the code of a generic scheduler, thereby exposing functions like sched(). The specific scheduling policy is implemented in a different file.
kernel/sched_fair.c: this is the main file that implements the CFS scheduler and provides the SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE scheduling policies.
kernel/sched_rt.c: provides the SCHED_RR and SCHED_FIFO policies used by real-time (RT) threads.
Each of these scheduling policies (fair and RT) registers its function pointers with the core scheduler. The core scheduler calls the appropriate scheduler (fair or RT), based on the scheduling policy of the particular process. As with the O(1) scheduler, real-time processes will have higher priority than normal processes. CFS mainly addresses non-real-time processes, and the RT scheduler remains more or less the same as before (except for a few changes as to how non-active/expired arrays are maintained).
With this new modular scheduler in place, people who want to write new schedulers for a particular policy can do so by simply registering these new policy functions with the core scheduler.