Linux System Calls
This article aims to give the reader, either a kernel novice or a seasoned programmer, a better understanding of the dynamics of system calls in Linux. Wherever code sections are mentioned, I refer to the 2.3.52 (soon to be 2.4) series of kernels unless otherwise noted.
The most widespread CPU architecture is the IA32, a.k.a. x86, which is the architecture of the 386, 486, the Pentiums I, Pro, II and III, AMD's competing K6 and Athlon lines, plus CPUs from others such as VIA/Cyrix and Integrated Device Technologies. Because it is the most widespread, it will be taken as the illustrative example here. First, I will cover the mechanisms provided by the IA32 type of CPU for handling system calls, and then show how Linux uses those mechanisms. To review a few broad terms:
A kernel is the operating system software running in protected mode and having access to the hardware's privileged registers. The kernel is not a separate process running on the system. It is the guts of the operating system, which controls the scheduling of processes to achieve multitasking, and provides a set of routines, constantly in memory, to which every user-space process has access.
Some operating systems employ a microkernel architecture, wherein device drivers and other code are loaded and executed on demand and are not necessarily always in memory.
A monolithic architecture is more common among UNIX implementations; it is the design employed by classic designs such as BSD.
The Linux kernel is mostly a monolithic kernel: i.e., all device drivers are part of the kernel proper. Unlike BSD, a Linux kernel's device drivers can be “loadable”, i.e., they can be loaded and unloaded from memory through user commands.
Basically, multitasking is accomplished in this way: the kernel switches control between processes rapidly, using the clock interrupt (and other means) to trigger a switch from one process to another. When a hardware device issues an interrupt, the interrupt handler is found within the kernel. When a process takes an action that requires it to wait for results, the kernel steps in and puts the process into an appropriate sleeping or waiting state and schedules another process in its place.
Besides multitasking, the kernel also contains the routines which implement the interface between user programs and hardware devices, virtual memory, file management and many other aspects of the system.
Kernel routines to achieve all of the above can be called from user-space code in a number of ways. One direct method to utilize the kernel is for a process to execute a system call. There are 116 system calls; documentation for these can be found in the man pages.
A system call is a request by a running task to the kernel to provide some sort of service on its behalf. In general, the kernel services invoked by system calls comprise an abstraction layer between hardware and user-space programs, allowing a programmer to implement an operating environment without having to tailor his program(s) too specifically to one single brand or precise specific combination of system hardware components. System calls also serve this generalization function across programming languages; e.g., the read system call will read data from a file descriptor. To the programmer, this looks like another C function, but in actuality, the code for read is contained within the kernel.
The IA32 CPU recognizes two classes of events needing special processor attention: interrupts and exceptions. Both cause a forced context switch to a new procedure or task.
Interrupts can occur at unexpected times during the execution of a program and are used to respond to signals; they are signals that processor attention is needed from hardware. When a hardware device issues an interrupt, the interrupt handler is found within the kernel. Next month, we will discuss interrupts in more detail.
Two sources of interrupts are recognized by the IA32: maskable interrupts, for which vectors are determined by the hardware, and non-maskable interrupts (NMI Interrupts, or NMIs).
Exceptions are either processor-detected or issued (thrown) from software. When a procedure or method encounters an abnormal condition (an exception condition) it can't handle, it may throw an exception. Exceptions of either type are caught by handler routines (_exception handlers_) positioned along the thread's procedure or method invocation stack. This may be the calling procedure or method, or if that doesn't include code to handle the exception condition, its calling procedure or method and so on. If one of the threads of your program throws an exception that isn't caught by any procedure (or method), then that thread will expire.
An exception tells a calling procedure that an abnormal (though not necessarily rare) condition has occurred, e.g., a method was invoked with an invalid argument. When you throw an exception, you are performing a kind of structured “go to” from the place in your program where the abnormal condition was detected to a place where it can be handled. Exception handlers should be stationed at program-module levels in accordance with how general a range of errors each is capable of handling in such a way that as few exception handlers as possible will cover as wide a variety of exceptions as are going to be encountered in field application of your programs.
In Java, exceptions are objects. In addition to throwing objects whose class is declared in java.lang, you can throw objects of your own design. To create your own class of throwable objects, you need to declare it as a subclass of some member of the Throwable family. In general, however, the throwable classes you define should extend class Exception--they should be “exceptions”. Usually, the class of the exception object indicates the type of abnormal condition encountered. For example, if a thrown exception object has class illegalArgumentException, that indicates someone passed an illegal argument to a method.
When you throw an exception, you instantiate and throw an object whose class, declared in java.lang, descends from Throwable, which has two direct subclasses: Exception and Error. Errors (members of the Error family) are usually thrown for more serious problems, such as OutOfMemoryError, that may not be easy to handle. Errors are usually thrown by the methods of the Java API or the Java Virtual Machine. In general, code you write should throw only exceptions, not errors.
The Java Virtual Machine uses the class of the exception object to decide which catch clause, if any, should be allowed to handle the exception. The catch clause can also get information on the abnormal condition by querying the exception object directly for information you embedded in it during instantiation (before throwing it). The Exception class allows you to specify a detailed message as a string that can be retrieved by invoking getMessage on the exception object.
Each IA32 interrupt or exception has a number, which is referred to in the IA32 literature as its vector. The NMI interrupt and the processor-detected exceptions have been assigned vectors in the range 0 through 31, inclusive. The vectors for maskable interrupts are determined by the hardware. External interrupt controllers put the vector on the bus during the interrupt-acknowledge cycle. Any vector in the range 32 through 255, inclusive, can be used for maskable interrupts or programmed exceptions.
The startup_32 code found in /usr/src/linux/boot/head.S starts everything off at boot time by calling setup_idt. This routine sets up an IDT (Interrupt Descriptor Table) with 256 entries, each four bytes long, total 1024 bytes, offsets 0-255. It should be noted that the IDT contains vectors to both interrupt handlers and exception handlers, so “IDT” is something of a misnomer, but that's the way it is.
No interrupt entry points are actually loaded by startup_32, as that is done only after paging has been enabled and the kernel has been relocated to 0xC000000. At times, mostly during boot, the kernel must be loaded into certain addresses, because the underlying BIOS architecture demands it. After control is passed to the kernel exclusively, the Linux kernel can put itself wherever it wants. Usually this is very high up in memory, but below the 2GB limit.
When start_kernel (found in /usr/src/linux/init/main.c) is called, it invokes trap_init (found in /usr/src/linux/kernel/traps.c). trap_init sets up the IDT via the macro set_trap_gate (found in /usr/include/asm/system.h) and initializes the interrupt descriptor table as shown in the “Offset Descriptionis” table.
At this point, the interrupt vector for the system calls is not set up. It is initialized by sched_init (found in /usr/src/linux/kernel/sched.c). To set interrupt 0x80 to be a vector to the _system_call entry point, call:
set_system_gate (0x80, &system_call)
The priority of simultaneously seen interrupts and exceptions is shown in the sidebar “Runtime Priority of Interrupts”.
The Linux system call interface is vectored through a stub in libc (often glibc) and is exclusively “register-parametered”, i.e., the stack is not used for parameter passing. Each call within the libc library is generally a syscallX macro, where X is the number of parameters used by the actual routine. Under Linux, the execution of a system call is invoked by a maskable interrupt or exception class transfer (e.g., “throwing” an exception object), caused by the instruction in 0x80. Vector 0x80 is used to transfer control to the kernel. This interrupt vector is initialized during system startup, along with other important vectors such as the system clock vector. On the assembly level (in user space), it looks like Listing 1. Nowadays, this code is contained in the glibc2.1 library. 0x80 is hardcoded into both Linux and glibc, to be the system call number which transfers control to the kernel. At bootup, the kernel has set up the IDT vector 0x80 to be a “call gate” (see arch/i386/kernel/traps.c:trap_init):
The vector layout is defined in include/asm-i386/hw_irq.h.
Not until the int $0x80 is executed does the call transfer to the kernel entry point _system_call. This entry point is the same for all system calls. It is responsible for saving all registers, checking to make sure a valid system call was invoked, then ultimately transferring control to the actual system call code via the offsets in the _sys_call_table. It is also responsible for calling _ret_from_sys_call when the system call has been completed, but before returning to user space.
Actual code for the system_call entry point can be found in /usr/src/linux/kernel/sys_call.S and the code for many of the system calls can be found in /usr/src/linux/kernel/sys.c. Code for the rest is distributed throughout the source files. Some system calls, like fork, have their own source file (e.g., kernel/fork.c).
The next instruction the CPU executes after the int $0x80 is the pushl %eax in entry.S:system_call. There, we first save all user-space registers, then we range-check %eax and call sys_call_table[%eax], which is the actual system call.
Since the system call interface is exclusively register-parametered, six parameters at most can be used with a single system call. %eax is the syscall number; %ebx, %ecx, %edx, %esi, %edi and %ebp are the six generic registers used as param0-5; and %esp cannot be used because it's overwritten by the kernel when it enters ring 0 (i.e., kernel mode).
In case more parameters are needed, some structure can be placed wherever you want within your address space and pointed to from a register (not the instruction pointer, nor the stack pointer; the kernel-space functions use the stack for parameters and local variables). This case is extremely rare, though; most system calls have either no parameters or only one.
Once the system call returns, we check one or more status flags in the process structure; the exact number will depend on the system call. creat might leave a dozen flags (existing, created, locked, etc.), whereas a sync might return only one.
If no work is pending, we restore user-space registers and return to user space via iret. The next instruction after the iret is the user-space popl %ebx instruction shown in Listing 1.
Some system calls are more complex then others because of variable-length argument lists. Examples of a complex system call include open and ioctl. However, even complex system calls must use the same entry point; they just have more overhead for parameter setup. Each syscall macro expands to an assembly routine which sets up the calling stack frame and calls _system_call through an interrupt, via the instruction int $0x80. For example, the setuid system call is coded as
which expands to the assembly code shown in Listing 2.
The user-space call code library can be found in /usr/src/libc/syscall. The hard-coding of the parameter layout and actual system call numbers is not a problem, because system calls are never really changed; they are only “introduced” and “obsoleted”. An obsoleted system call is marked with the old_ prefix in the system call table for entry.S, and reference to it is removed from the next glibc. Once no application uses that system call anymore, its slot is marked “unused” and is potentially reusable for a newly introduced system call.
If a user wishes to trace a program, it is equally important to know what happens during system calls. Thus, the trace of a program usually includes a trace through the system calls as well. This is done through SIGSTOP and SIGCHLD ping-ponging between parent (tracing process) and child (traced process). When a traced process is executed, every system call is preceded by a sys_ptrace call. This makes the traced process send a SIGCHILD to the tracing process each time a system call is made. The traced process immediately enters the TASK_STOPPED state (a flag is set in the task_struct structure). The tracing process can then examine the entire address space of the traced process through the use of _ptrace, which is a multi-purpose system call. The tracing process sends a SIGSTOP to allow execution again.
Adding your own system calls is actually quite easy. Follow this list of steps to do so. Remember, if you do not make these system calls available on all the machines you want your program to run on, the result will be non-portable code.
Create a directory under the /usr/src/linux/ directory to hold your code.
Put any include files in /usr/include/sys/ and /usr/include/linux/.
Add the relocatable module produced by the link of your new kernel code to the ARCHIVES and the subdirectory to the SUBDIRS lines of the top-level Makefile. See fs/Makefile, target fs.o for an example.
Add a #define __NR_xx to unistd.h to assign a call number for your system call, where xx, the index, is something descriptive relating to your system call. It will be used to set up the vector through sys_call_table to invoke your code.
Add an entry point for your system call to the sys_call_table in sys.h. It should match the index (xx) you assigned in the previous step.
The NR_syscalls variable will be recalculated automatically.
Modify any kernel code in kernel/fs/mm/, etc. to take into account the environment needed to support your new code.
Run make from the top source code directory level to produce the new kernel incorporating your new code.
At this point, you must either add a syscall to your libraries, or use the proper _syscalln macro in your user program in order for your programs to access the new system call. The 386DX Microprocessor Programmer's Reference Manual is a helpful reference, as is James Turley's Advanced 80386 Programming Techniques.
A list of Linux/IA32 kernel system calls can be found, with the listings, in the archive file ftp.linuxjournal.com/pub/lj/listings/issue75/4048.tgz. Note: these are not libc “user-space system calls”, but real kernel system calls provided by the Linux kernel. Information source is GNU libc project, http://www.gnu.org/.
Moshe Bar (email@example.com) is an Israeli system administrator and OS researcher who started learning UNIX on a PDP-11 with AT&T UNIX Release 6 back in 1981. He holds an M.Sc. in computer science. His new book Linux Kernel Internals will be published this year. You may visit Moshe's web site at http://www.moelabs.com/.