The Linux Process Model

A look at the fundamental building blocks of the Linux kernel.
What about Linux?

In Linux 2.3.x (and in the future, 2.4.0), it is a run-time tunable parameter as well. On 2.2.x, it's a compile-time tunable parameter. To change it in 2.2.x, you need to change the NR_TASKS preprocessor define in Linux/include/linux/tasks.h:

#define NR_TASKS 512 /* On x86 Max 4092 or 4090
                        with APM configured. */

Increase this number up to 4090 to increase the maximum limit of concurrent tasks.

In 2.3.x, it is a tunable parameter which defaults to size-of-memory-in-the-system / kernel-stack-size / 2. Suppose you have 512MB of RAM; then, the default upper limit of available processes will be 512*1024*1024 / 8192 / 2 = 32768. Now, 32768 processes might sound like a lot, but for an enterprise-wide Linux server with a database and many connections from a LAN or the Internet, it is a very reasonable number. I have personally seen UNIX boxes with a higher number of active processes. It might make sense to adjust this parameter in your installation. In 2.3.x, you can also increase the maximum number of tasks via a sysctl at runtime. Suppose the administrator wants to increase the number of concurrent tasks to 40,000. He will have to do only this (as root):

echo 40000 > /proc/sys/kernel/threads-max
Processes and Threads

In the last 10 years or so, there has been a general move from heavyweight processes to a threaded model. The reason is clear: the creation and maintenance of a full process with its own address space takes up a lot of time in terms of milliseconds. Threads run within the same address space as the parent process, and therefore require much less time in creation.

What's the difference between process and thread under Linux? And, more importantly, what is the difference from a scheduler point of view? In short—nothing.

The only worthwhile difference between a thread and a process is that threads share the same address space completely. All the threads run in the same address space, so a context switch is basically just a jump from one code location to another.

A simple check to avoid the TLB (translation lookaside buffer, the mechanism within the CPU that translates virtual memory addresses to actual RAM addresses) flush and the memory manager context switch is this:

/* cut from linux/arch/i386/kernel/process.c */
/* Re-load page tables */
        unsigned long new_cr3 = next->tss.cr3;
        if (new_cr3 !=3D prev->tss.cr3)
        asm volatile("movl %0,%%cr3": :"r" (new_cr3));

The above check is in the core of the Linux kernel context switch. It simply checks that the page-directory address of the current process and the one of the to-be-scheduled process are the same. If they are, then they share the same address space (i.e., they are two threads), and nothing will be written to the %%cr3 register, which would cause the user-space page tables to be invalidated. That is, putting any value into the %%cr3 register automatically invalidates the TLB; in fact, this is actually how you force a TLB flush. Since two tasks in the same address space never switch the address space, the TLB will never be invalidated.

With the above two-line check, Linux makes a distinction between a kernel-process switch and a kernel-thread switch. This is the only noteworthy difference.

Since there is no difference at all between threads and processes, the Linux scheduler is very clean code. Only a few places related to signal handling make a distinction between threads and processes.

In Solaris, the process is greatly disadvantaged compared to the thread and lightweight processes (LWP). Here is a measurement I did on my Solaris server, an Ultra 2 desktop, 167MHz processor, running Solaris 2.6:

hirame> ftime
Completed 100 forks
Avg Fork Time: 1.137 milliseconds
hirame> ttime
Completed 100 Thread Creates
Avg Thread Time: 0.017 milliseconds

I executed 100 forks and measured the time elapsed. As you can see, the average fork took 1.137 milliseconds, while the average thread create took .017 milliseconds (17 microseconds). In this example, thread creates were about 67 times faster. Also, my test case for threads did not include flags in the thread create call to tell the kernel to create a new LWP with the thread and bind the thread to the LWP. This would have added additional weight to the call, bringing it closer to the fork time.

Even if LWP creation closes the gap in creation times between processes (forks) and threads, user threads still offer advantages in resource utilization and scheduling.

Of course, the Linux SMP (and even uniprocessor) scheduler is clever enough to optimize the scheduling of the threads on the same CPU. This happens because by rescheduling a thread, there won't be a TLB flush and basically no context switch at all—the virtual memory addressing won't change. A thread switch is very lightweight compared to a process switch, and the scheduler is aware of that. The only things Linux does while switching between two threads (not in strict order) are:

  • Enter schedule().

  • Restore all registers of the new thread (stack pointer and floating point included).

  • Update the task structure with the data of the new thread.

  • Jump to the old entry point of the new thread.

Nothing more is done. The TLB is not touched, and the address space and all the page tables remain the same. Here, the big advantage of Linux is that it does the above very fast.

Other UNIX systems are bloated by SMP locks, so the kernel loses time getting to the task switch point. If that weren't true, the Solaris kernel threads wouldn't be slower than the user-space kernel threads. Of course, the kernel-based threads will scale the load across multiple CPUs, but operating systems like Solaris pay a big fixed cost on systems with few CPUs for the benefit of scaling well with many CPUs. Basically, there is no technical reason why Solaris kernel threads should be lighter than Linux kernel threads. Linux is just doing the minimum possible operations in the context switch path, and it's doing them fast.