Anatomy of a Read and Write Call

We look at three different tactics for optimizing read and write performance under Linux.

A few years ago I was tasked with making the Spec96 benchmark suite produce the fastest numbers possible using the Solaris Intel operating system and Compaq Proliant servers. We were given all the resources that Sun Microsystems and Compaq Computer Corporation could muster to help take both companies to the next level in Unix computing on the Intel architecture. Sun had just announced its flagship operating system on the Intel platform and Compaq was in a heated race with Dell for the best departmental servers. Unixware and SCO were the primary challengers since Windows NT 3.5 was not very stable at the time and no one had ever heard of an upstart graduate student from overseas who thought that he could build a kernel that rivaled those of multi-billion dollar corporations.

Now many years later, Linux has gained considerable market share and is the De facto Unix for all the major hardware manufacturers on the Intel architecture. In this article, I will attempt to take the lessons learned from this tuning exercise and show how they can be applied to the Linux operating system.

As it turned out, the gcc benchmark was the one that everyone seemed to be improving on the most. As we analyzed what the benchmark was doing, we found out that basically it opened a file, read its contents, created a new file, wrote new contents, then closed both files. It did this over and over and over. File operations proved to be the bottleneck in performance. We tried faster processors with insignificant improvement. We tried processors with huge (at the time) level 1 and level 2 cache and still found no significant improvement. We tried using a gigabyte of memory and found little or no improvement. By using the vmstat command, we found that the processor was relatively idle, little memory was being used, but we were getting a significant amount of reads and writes to the root disk. Using the same hardware and same test programs, Unixware was 25% faster than Solaris Intel. Initially, we decided that Solaris was just really slow. Unfortunately, I was working for Sun at the time and this was not the answer that we could take to my management. We had to figure out why it was slow and make recommendations on how to improve the performance. The target was 25% faster than Unixware, not slower.

The first thing that we did was to look at the configurations. It turns out that the two systems were identical hardware,. We just booted a different disk to boot the other operating system. The Unixware system was configured with /tmp as a tmpfs whereas the Solaris system had /tmp on the root file system. We changed the Solaris configuration to use tmpfs but it did not significantly improve performance. Later, we found that this was due to a bug in the tmpfs implementation on Solaris Intel. By braking down the file operation, we decided to focus on three areas; the libc interface, the node/dentry layer, and the device drivers managing the disk. In this article, we will look at the three different layers and talk about how to improve performance and how they specifically apply to Linux.

Test Program

If we take a characteristic program and look at what it does, we can drill a little deeper into the operating system on each pass. The program that we will use is relatively simple:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
main() {
 int f_out, f_in;
 char *buffer_out = "1234567890";
 char *buffer_in;
 if ((f_out = create("test_file",S_IWUSR | S_IRUSR) ) < 0) {
  printf("error creating test_file\n");
 if (write(f_out,buffer_out,(size_t)strlen(buffer_out)) < 0) {
  printf("problems writing to test_file\n");
 if ((f_in = open("test_file",O_RDONLY)) < 0) {
  printf("error opening test_file for read\n");
 if (read(f_in,buffer_in,(size_t)strlen(buffer_out)) < 0) {
  printf("error reading from test_file\n");

The operation that we will perform will be simple.

Create -> libc  -> kernel
Write -> libc -> kernel
Close -> libc -> kernel
Open -> libc -> kernel
Read -> libc -> kernel
Close -> libc -> kernel
libc -> kernel (to exit)
Libc optimizations

When we compile this program, by using the ldd command we see that the routines; create, write, close, open, and read, are all part of libc. On a RedHat 7.3 system, ldd returned that /lib/i686/ and the loader are the only libraries that were included when compiled. Further investigation with the nm command shows that we actually link with the GLIB_2.0 which correlates to the gcc compiler that we used to compile the program with and not the libc in the operating system. Since libc is basically part of the operating system, there does not seem like much we can do.

Fortunately, it turns out that there are a variety of options available. Initially, for our benchmark we tried statically linking our program which had marginal improvement but nothing substantial. We then tried using the libc that came with the gcc compiler. It had a noticeable improvement in performance but not as much as we wanted. By mistake, we tried the Unixware libc dynamically linked to the Solaris binary and got 30% better performance than with the Solaris libc. Basically we had a substantial improvement in performance and didn't do anything and didn't know why. Since we didn't have the source to Unixware but did have the source to Solaris and the gcc libc, we did a comparison. It turns out that the Solaris implementation had substantially more test cases and significant overhead that it imposed between the users program and the system call to get into the kernel. A substantial amount of code was written to make sure that buffers did not overflow or pointers run off into the stack in the Solaris libraries.

Libc -> system_call -> file system -> device drivers (read and write)
Libc -> system_call -> file system  (open, close, create)

Basically, what is done at the libc layer is that the random input from the user program is copied onto the stack and tests are made to make sure that it is not malicious code or code that might attempt to gain root access. A hardware interrupt is then generated requesting that control be taken from a user process into the kernel. The interrupt takes the data that is on the stack and passes it into an interrupt handler. The code for this interrupt handler can be found in /usr/src/linux/arch/i386/kernel/entry.S. This interrupt handler decides that this is a call into the operating system and transfers control to the kernel to process the request or decides that the call is an invalid call and returns with an error.

If the kernel notices that it is a request to create a file, it goes into the routines that deals with file systems. This is done through the sys_call_table entry for sys_creat. This linkage takes you to the file /usr/src/linux/kernel/module.c and the sys_create_module routine. This routine figures out if the file name already exists returns an error or creates the name in the directory name space. If the kernel notices that it is a read from a file on a file system, through sys_call_table structure sys_read, it calls the device driver that controls the file system and eventually controls the hardware for the attached disk. The /usr/src/linux/fs/read_write.c routine is linked for reads and writes. For the read command, this eventually resolved to the kernel_read located in /usr/src/linux/fs/exec.c. This module determines which file system that the file resides and calls the read function using the device driver structures linking it to the read function. Similar entries exist for write and close.

It turns out that the libc on Solaris had substantial error checking, boundary checking, and stack controls that prohibited users from hijacking the operating system. Unixware and GNU did not meticulously check for these error conditions thus was substantially faster. Since our intent was to produce the fastest benchmark numbers possible, we went with the Unixware libc and continued our optimizations. Once we figured out that we had optimized everything in user space with tricks like running the application as a real-time thread, running /tmp in tmpfs, dynamically linking with a fast libc, and running the test three times to make sure that all of the code fit into cache and remained memory resident for subsequent runs, we were ready to figure out how to optimize the kernel.

The decision that we made at this point was that performance was the most important objective. Security, stability, and reliability were no longer concerns for our system and secondary objectives. Stability was important as long as the system did not crash before or during our tests. If your intent is truly to proceed the fastest linkage from a read command into the kernel, you might look at bypassing the read and going straight into the system_call. This is a bit risky and does reduce functionality but for raw reads and writes it produces optimum code.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Anatomy of a Read and Write Call

Anonymous's picture

buffer_in is never intialized. It should be defined char buffer_in[sizeof(buffer_out)].

Small Spelling Mistake

Anonymous's picture

> ... Three things effect performance at this layer. ...

I think you mean affect.

Re: Small Spelling Mistake

Anonymous's picture

If they were done well, they would effect good performance

False Security

Anonymous's picture

If the system depends on parameter checking for system calls to be in libc, then the system is not secure. The kernel must do these checks... after all, a program can bypass the C library and do a system call itself.

Also, checking for NULL buffers is silly. A program that reads into a NULL or writes from a NULL is broken and trying to make it looks like it is doing the right thing is probably a bad idea.