Linux System Calls

How to use the mechanism provided by the IA32 architecture for handling system calls.
An Example of Exceptions as Objects from Java

In Java, exceptions are objects. In addition to throwing objects whose class is declared in java.lang, you can throw objects of your own design. To create your own class of throwable objects, you need to declare it as a subclass of some member of the Throwable family. In general, however, the throwable classes you define should extend class Exception--they should be “exceptions”. Usually, the class of the exception object indicates the type of abnormal condition encountered. For example, if a thrown exception object has class illegalArgumentException, that indicates someone passed an illegal argument to a method.

When you throw an exception, you instantiate and throw an object whose class, declared in java.lang, descends from Throwable, which has two direct subclasses: Exception and Error. Errors (members of the Error family) are usually thrown for more serious problems, such as OutOfMemoryError, that may not be easy to handle. Errors are usually thrown by the methods of the Java API or the Java Virtual Machine. In general, code you write should throw only exceptions, not errors.

The Java Virtual Machine uses the class of the exception object to decide which catch clause, if any, should be allowed to handle the exception. The catch clause can also get information on the abnormal condition by querying the exception object directly for information you embedded in it during instantiation (before throwing it). The Exception class allows you to specify a detailed message as a string that can be retrieved by invoking getMessage on the exception object.


Each IA32 interrupt or exception has a number, which is referred to in the IA32 literature as its vector. The NMI interrupt and the processor-detected exceptions have been assigned vectors in the range 0 through 31, inclusive. The vectors for maskable interrupts are determined by the hardware. External interrupt controllers put the vector on the bus during the interrupt-acknowledge cycle. Any vector in the range 32 through 255, inclusive, can be used for maskable interrupts or programmed exceptions.

The startup_32 code found in /usr/src/linux/boot/head.S starts everything off at boot time by calling setup_idt. This routine sets up an IDT (Interrupt Descriptor Table) with 256 entries, each four bytes long, total 1024 bytes, offsets 0-255. It should be noted that the IDT contains vectors to both interrupt handlers and exception handlers, so “IDT” is something of a misnomer, but that's the way it is.

No interrupt entry points are actually loaded by startup_32, as that is done only after paging has been enabled and the kernel has been relocated to 0xC000000. At times, mostly during boot, the kernel must be loaded into certain addresses, because the underlying BIOS architecture demands it. After control is passed to the kernel exclusively, the Linux kernel can put itself wherever it wants. Usually this is very high up in memory, but below the 2GB limit.

When start_kernel (found in /usr/src/linux/init/main.c) is called, it invokes trap_init (found in /usr/src/linux/kernel/traps.c). trap_init sets up the IDT via the macro set_trap_gate (found in /usr/include/asm/system.h) and initializes the interrupt descriptor table as shown in the “Offset Descriptionis” table.

Offset Descriptions

Table 1

At this point, the interrupt vector for the system calls is not set up. It is initialized by sched_init (found in /usr/src/linux/kernel/sched.c). To set interrupt 0x80 to be a vector to the _system_call entry point, call:

set_system_gate (0x80, &system_call)

The priority of simultaneously seen interrupts and exceptions is shown in the sidebar “Runtime Priority of Interrupts”.

Runtime Priority of Interrupts

The System Call Interface

The Linux system call interface is vectored through a stub in libc (often glibc) and is exclusively “register-parametered”, i.e., the stack is not used for parameter passing. Each call within the libc library is generally a syscallX macro, where X is the number of parameters used by the actual routine. Under Linux, the execution of a system call is invoked by a maskable interrupt or exception class transfer (e.g., “throwing” an exception object), caused by the instruction in 0x80. Vector 0x80 is used to transfer control to the kernel. This interrupt vector is initialized during system startup, along with other important vectors such as the system clock vector. On the assembly level (in user space), it looks like Listing 1. Nowadays, this code is contained in the glibc2.1 library. 0x80 is hardcoded into both Linux and glibc, to be the system call number which transfers control to the kernel. At bootup, the kernel has set up the IDT vector 0x80 to be a “call gate” (see arch/i386/kernel/traps.c:trap_init):

Listing 1


The vector layout is defined in include/asm-i386/hw_irq.h.

Not until the int $0x80 is executed does the call transfer to the kernel entry point _system_call. This entry point is the same for all system calls. It is responsible for saving all registers, checking to make sure a valid system call was invoked, then ultimately transferring control to the actual system call code via the offsets in the _sys_call_table. It is also responsible for calling _ret_from_sys_call when the system call has been completed, but before returning to user space.

Actual code for the system_call entry point can be found in /usr/src/linux/kernel/sys_call.S and the code for many of the system calls can be found in /usr/src/linux/kernel/sys.c. Code for the rest is distributed throughout the source files. Some system calls, like fork, have their own source file (e.g., kernel/fork.c).

The next instruction the CPU executes after the int $0x80 is the pushl %eax in entry.S:system_call. There, we first save all user-space registers, then we range-check %eax and call sys_call_table[%eax], which is the actual system call.

Since the system call interface is exclusively register-parametered, six parameters at most can be used with a single system call. %eax is the syscall number; %ebx, %ecx, %edx, %esi, %edi and %ebp are the six generic registers used as param0-5; and %esp cannot be used because it's overwritten by the kernel when it enters ring 0 (i.e., kernel mode).

In case more parameters are needed, some structure can be placed wherever you want within your address space and pointed to from a register (not the instruction pointer, nor the stack pointer; the kernel-space functions use the stack for parameters and local variables). This case is extremely rare, though; most system calls have either no parameters or only one.

Once the system call returns, we check one or more status flags in the process structure; the exact number will depend on the system call. creat might leave a dozen flags (existing, created, locked, etc.), whereas a sync might return only one.

If no work is pending, we restore user-space registers and return to user space via iret. The next instruction after the iret is the user-space popl %ebx instruction shown in Listing 1.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

system call

Anonymous's picture


Just wanted to know if I want to printk the parameters used in a systemcall how do I go about it? To do so I am trying to access the %eax, %ebx and the other registers used to store the parameters and printk the parameters but not sure how to go about it. I am writing a loadable kernel module to do so. Any idea how to do it?

Thanks in Advance

fork() implementation

Anonymous's picture

can anyone tell me where assembly routines implementing sysytem call can be found.

Very good article overall.

Anonymous's picture

Very good article overall. But I don't know what software exceptions have to do with interruptions and hardware exceptions; I am pretty sure they are totally unrelated. I think it would be better to focus in hardware exceptions, like arithmetic ones, whose handlers are placed in the vector table.

Much of the content in this a

Anonymous's picture

Much of the content in this article was taken directly from "How System Calls Work on Linux/i86" by Michael K. Johnson and Stanley Scalsky which is located at this URL:

Copyright (C) 1993, 1996 Michael K. Johnson,
Copyright (C) 1993 Stanley Scalsky

No mention of credit was given by Moshe Bar to the original authors. At the very least this is plagiarism and a blatant copyright violation.

You must not have read the original article, or did you?

Carsten's picture

The two articles are not in the least identical and the article you refer to is not in the least as exhaustive as this article. I would say that Moshe did a good job here at both taking available information by the kernel hackers and, second, providing even more indepth information on how it is actually done.

who cares....

Anonymous's picture

who cares....