Completely Fair Scheduler
In order for the CFS to emulate an “ideal, precise, multitasking CPU” by giving each runnable process an equal slice of execution time, CFS needs to have the following:
A mechanism to calculate what the fair CPU share is per process. This is achieved by using a system-wide runqueue fair_clock variable (cfs_rq->fair_clock). This fair clock runs at a fraction of real time, so that it runs at the ideal pace for a single task when there are N runnable tasks in the system. For example, if you have four runnable tasks, fair_clock increases at one-fourth of the speed of wall time (which means 25% fair CPU power).
A mechanism to keep track of the time for which each process was waiting while the CPU was assigned to the currently running task. This wait time is accumulated in the per-process variable wait_runtime (process->wait_runtime).
Red-Black Tree (RBTree)
A red-black tree is a type of self-balancing binary search tree—a data structure typically used to implement associative arrays. It is complex, but it has good worst-case running time for its operations and is efficient in practice. It can search, insert and delete in O(log n) time, where n is the number of elements in the tree. In red-black trees, the leaf nodes are not relevant and do not contain data. These leaves need not be explicit in computer memory—a null child pointer can encode the fact that this child is a leaf—but it simplifies some algorithms for operating on red-black trees if the leaves really are explicit nodes. To save memory, sometimes a single sentinel node performs the role of all leaf nodes; all references from internal nodes to leaf nodes then point to the sentinel node. (Source: Wikipedia.)
CFS uses the fair clock and wait runtime to keep all the runnable tasks sorted by the rq->fair_clock - p->wait_runtime key in the rbtree (see the Red-Black Tree sidebar). So, the leftmost task in the tree is the one with the “gravest CPU need”, and CFS picks the leftmost task and sticks to it. As the system progresses forward, newly awakened tasks are put into the tree farther and farther to the right—slowly but surely giving every task a chance to become the leftmost task and, thus, get on the CPU within a deterministic amount of time.
Because of this simple design, CFS no longer uses active and expired arrays and dispensed with sophisticated heuristics to mark tasks as interactive versus non-interactive.
CFS implements priorities by using weighted tasks—each task is assigned a weight based on its static priority. So, while running, the task with lower weight (lower-priority) will see time elapse at a faster rate than that of a higher-priority task. This means its wait_runtime will exhaust more quickly than that of a higher-priority task, so lower-priority tasks will get less CPU time compared to higher-priority tasks.
CFS has been modified a bit further in 2.6.24. Although the basic concept of fairness remains, a few implementation details have changed. Now, instead of chasing the global fair clock (rq->fair_clock), tasks chase each other. A clock per task, vruntime, is introduced, and an approximated average is used to initialize this clock for new tasks. Each task tracks its runtime and is queued in the RBTree using this parameter. So, the task that has run least (the one that has the gravest CPU need) is the leftmost node of the RBTree and will be picked up by the scheduler. See Resources for more details about this implementation.
In kernel 2.6.24, another major addition to CFS is group scheduling. Plain CFS tries to be fair to all the tasks running in the system. For example, let's say there is a total of 25 runnable processes in the system. CFS tries to be fair by allocating 4% of the CPU to all of them. However, let's say that out of these 25 processes, 20 belong to user A while 5 belong to user B. User B is at an inherent disadvantage, as A is getting more CPU power than B. Group scheduling tries to eliminate this problem. It first tries to be fair to a group and then to individual tasks within that group. So CFS, with group scheduling enabled, will allocate 50% of the CPU to each user A and B. The allocated 50% share of A will be divided fairly among A's 20 tasks, while the other 50% of the CPU time will be distributed fairly among B's 5 tasks.
With kernel 2.6.23, the Linux scheduler also has been made modular. Each scheduling policy (SCHED_FIFO, SCHED_RR, SCHED_OTHER and so on) can be implemented independently of the base scheduler code. This modularization is similar to object-oriented class hierarchies (Figure 3).
The core scheduler does not need to be aware of the implementation details of the individual scheduling policies. In kernel 2.6.23, sched.c (the “scheduler” from older kernels) is divided into the following files to make the scheduler modular:
kernel/sched.c: contains the code of a generic scheduler, thereby exposing functions like sched(). The specific scheduling policy is implemented in a different file.
kernel/sched_fair.c: this is the main file that implements the CFS scheduler and provides the SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE scheduling policies.
kernel/sched_rt.c: provides the SCHED_RR and SCHED_FIFO policies used by real-time (RT) threads.
Each of these scheduling policies (fair and RT) registers its function pointers with the core scheduler. The core scheduler calls the appropriate scheduler (fair or RT), based on the scheduling policy of the particular process. As with the O(1) scheduler, real-time processes will have higher priority than normal processes. CFS mainly addresses non-real-time processes, and the RT scheduler remains more or less the same as before (except for a few changes as to how non-active/expired arrays are maintained).
With this new modular scheduler in place, people who want to write new schedulers for a particular policy can do so by simply registering these new policy functions with the core scheduler.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- ServersCheck's Thermal Imaging Camera Sensor
- The Italian Army Switches to LibreOffice
- Linux Mint 18
- Petros Koutoupis' RapidDisk
- Oracle vs. Google: Round 2
- The FBI and the Mozilla Foundation Lock Horns over Known Security Hole
- Privacy and the New Math
- Ben Rady's Serverless Single Page Apps (The Pragmatic Programmers)
Until recently, IBM’s Power Platform was looked upon as being the system that hosted IBM’s flavor of UNIX and proprietary operating system called IBM i. These servers often are found in medium-size businesses running ERP, CRM and financials for on-premise customers. By enabling the Power platform to run the Linux OS, IBM now has positioned Power to be the platform of choice for those already running Linux that are facing scalability issues, especially customers looking at analytics, big data or cloud computing.
￼Running Linux on IBM’s Power hardware offers some obvious benefits, including improved processing speed and memory bandwidth, inherent security, and simpler deployment and management. But if you look beyond the impressive architecture, you’ll also find an open ecosystem that has given rise to a strong, innovative community, as well as an inventory of system and network management applications that really help leverage the benefits offered by running Linux on Power.Get the Guide