Anatomy of a Read and Write Call

by Pat Shuff

A few years ago I was tasked with making the Spec96 benchmark suite produce the fastest numbers possible using the Solaris Intel operating system and Compaq Proliant servers. We were given all the resources that Sun Microsystems and Compaq Computer Corporation could muster to help take both companies to the next level in Unix computing on the Intel architecture. Sun had just announced its flagship operating system on the Intel platform and Compaq was in a heated race with Dell for the best departmental servers. Unixware and SCO were the primary challengers since Windows NT 3.5 was not very stable at the time and no one had ever heard of an upstart graduate student from overseas who thought that he could build a kernel that rivaled those of multi-billion dollar corporations.

Now many years later, Linux has gained considerable market share and is the De facto Unix for all the major hardware manufacturers on the Intel architecture. In this article, I will attempt to take the lessons learned from this tuning exercise and show how they can be applied to the Linux operating system.

As it turned out, the gcc benchmark was the one that everyone seemed to be improving on the most. As we analyzed what the benchmark was doing, we found out that basically it opened a file, read its contents, created a new file, wrote new contents, then closed both files. It did this over and over and over. File operations proved to be the bottleneck in performance. We tried faster processors with insignificant improvement. We tried processors with huge (at the time) level 1 and level 2 cache and still found no significant improvement. We tried using a gigabyte of memory and found little or no improvement. By using the vmstat command, we found that the processor was relatively idle, little memory was being used, but we were getting a significant amount of reads and writes to the root disk. Using the same hardware and same test programs, Unixware was 25% faster than Solaris Intel. Initially, we decided that Solaris was just really slow. Unfortunately, I was working for Sun at the time and this was not the answer that we could take to my management. We had to figure out why it was slow and make recommendations on how to improve the performance. The target was 25% faster than Unixware, not slower.

The first thing that we did was to look at the configurations. It turns out that the two systems were identical hardware,. We just booted a different disk to boot the other operating system. The Unixware system was configured with /tmp as a tmpfs whereas the Solaris system had /tmp on the root file system. We changed the Solaris configuration to use tmpfs but it did not significantly improve performance. Later, we found that this was due to a bug in the tmpfs implementation on Solaris Intel. By braking down the file operation, we decided to focus on three areas; the libc interface, the node/dentry layer, and the device drivers managing the disk. In this article, we will look at the three different layers and talk about how to improve performance and how they specifically apply to Linux.

Test Program

If we take a characteristic program and look at what it does, we can drill a little deeper into the operating system on each pass. The program that we will use is relatively simple:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
main() {
 int f_out, f_in;
 char *buffer_out = "1234567890";
 char *buffer_in;
 if ((f_out = create("test_file",S_IWUSR | S_IRUSR) ) < 0) {
  printf("error creating test_file\n");
  exit(1);
 }
 if (write(f_out,buffer_out,(size_t)strlen(buffer_out)) < 0) {
  printf("problems writing to test_file\n");
  exit(1);
 }
 close(f_out);
 if ((f_in = open("test_file",O_RDONLY)) < 0) {
  printf("error opening test_file for read\n");
  exit(1);
 }
 if (read(f_in,buffer_in,(size_t)strlen(buffer_out)) < 0) {
  printf("error reading from test_file\n");
  exit(1);
 }
 close(f_in);
}

The operation that we will perform will be simple.

Create -> libc  -> kernel
Write -> libc -> kernel
Close -> libc -> kernel
Open -> libc -> kernel
Read -> libc -> kernel
Close -> libc -> kernel
libc -> kernel (to exit)
Libc optimizations

When we compile this program, by using the ldd command we see that the routines; create, write, close, open, and read, are all part of libc. On a RedHat 7.3 system, ldd returned that /lib/i686/libc.so.6 and the loader are the only libraries that were included when compiled. Further investigation with the nm command shows that we actually link with the GLIB_2.0 which correlates to the gcc compiler that we used to compile the program with and not the libc in the operating system. Since libc is basically part of the operating system, there does not seem like much we can do.

Fortunately, it turns out that there are a variety of options available. Initially, for our benchmark we tried statically linking our program which had marginal improvement but nothing substantial. We then tried using the libc that came with the gcc compiler. It had a noticeable improvement in performance but not as much as we wanted. By mistake, we tried the Unixware libc dynamically linked to the Solaris binary and got 30% better performance than with the Solaris libc. Basically we had a substantial improvement in performance and didn't do anything and didn't know why. Since we didn't have the source to Unixware but did have the source to Solaris and the gcc libc, we did a comparison. It turns out that the Solaris implementation had substantially more test cases and significant overhead that it imposed between the users program and the system call to get into the kernel. A substantial amount of code was written to make sure that buffers did not overflow or pointers run off into the stack in the Solaris libraries.

Libc -> system_call -> file system -> device drivers (read and write)
Libc -> system_call -> file system  (open, close, create)

Basically, what is done at the libc layer is that the random input from the user program is copied onto the stack and tests are made to make sure that it is not malicious code or code that might attempt to gain root access. A hardware interrupt is then generated requesting that control be taken from a user process into the kernel. The interrupt takes the data that is on the stack and passes it into an interrupt handler. The code for this interrupt handler can be found in /usr/src/linux/arch/i386/kernel/entry.S. This interrupt handler decides that this is a call into the operating system and transfers control to the kernel to process the request or decides that the call is an invalid call and returns with an error.

If the kernel notices that it is a request to create a file, it goes into the routines that deals with file systems. This is done through the sys_call_table entry for sys_creat. This linkage takes you to the file /usr/src/linux/kernel/module.c and the sys_create_module routine. This routine figures out if the file name already exists returns an error or creates the name in the directory name space. If the kernel notices that it is a read from a file on a file system, through sys_call_table structure sys_read, it calls the device driver that controls the file system and eventually controls the hardware for the attached disk. The /usr/src/linux/fs/read_write.c routine is linked for reads and writes. For the read command, this eventually resolved to the kernel_read located in /usr/src/linux/fs/exec.c. This module determines which file system that the file resides and calls the read function using the device driver structures linking it to the read function. Similar entries exist for write and close.

It turns out that the libc on Solaris had substantial error checking, boundary checking, and stack controls that prohibited users from hijacking the operating system. Unixware and GNU did not meticulously check for these error conditions thus was substantially faster. Since our intent was to produce the fastest benchmark numbers possible, we went with the Unixware libc and continued our optimizations. Once we figured out that we had optimized everything in user space with tricks like running the application as a real-time thread, running /tmp in tmpfs, dynamically linking with a fast libc, and running the test three times to make sure that all of the code fit into cache and remained memory resident for subsequent runs, we were ready to figure out how to optimize the kernel.

The decision that we made at this point was that performance was the most important objective. Security, stability, and reliability were no longer concerns for our system and secondary objectives. Stability was important as long as the system did not crash before or during our tests. If your intent is truly to proceed the fastest linkage from a read command into the kernel, you might look at bypassing the read and going straight into the system_call. This is a bit risky and does reduce functionality but for raw reads and writes it produces optimum code.

File system optimizations

The next layer that we needed to look at was the file system layer. When a file is created, the operating system must first figure out what file system that the user is currently working in. Fortunately, libc passes this information to the kernel as a parameter to the create call. A directory structure is laid out when files are created. Unfortunately, this structure is allocated in a chunk of memory and uses a link list to correlate a file with an inode.

During file creation the kernel allocates an inode that will describe this file and associates this inode with the file name that we created. Three things effect performance at this layer. The first, and most important is the length of the file name. For ufs, the structure that holds the file name is double indexed on Solaris but only a character array on Linux. The first index on Solaris can hold file names upto 14 characters. If the file name is larger than that, the index is a pointer to an array that can handle really long file names. This construct does not exist on Linux. It uses a default name length of 255 characters for this relationship. For ext3, the structure that holds the file name is held in the directory structure that correlates an inode to a filename. The filenames are stored with an EXT2_NAME_LEN structure that grows and shrinks as does the filename. This does add some overhead and complication. Performance is no longer limited by a single file name length but by all of the files in the directory.

If the combined names of all files in a directory exceed a 512 byte boundary, a second read to the disk will be needed to get the other names of the files. The operation will be slowed by the speed of reading a superblock which at best could be in memory and at worst must be read from disk. For operations like ls -l, the directory is hit multiple times to find all of the files located in the directory. This typically takes On2 time. If we have 100 files in a directory, we will need to walk the directory list 100 times to get all of the files. This means that we read the first file name in the directory 10000 times to get to the last filename. If we can only get one or two filenames into a dirent structure, we will then cover 256K-512K of memory and consume 50-100 blocks on the disk, assuming a 512 byte block. For a 32M system, we have consumed almost 20% of available memory to describe 1000 files. A typical Linux system has about 200,000 unique filenames including devices and special file types like sockets and /proc entries, thus not all of the directory/filename combinations can not be retained in memory and must be cached.

The second parameter that affects performance is the cache size of the directory names. Making this too large consumes a significant amount of memory. Making it too small forces the kernel to re-read the superblock from the disk to get the directory and file information. As was described in the previous section, having as many superblocks and filenames in memory greatly reduces performance penalties. Linux does not take the same approach as other Unix implementations. Solaris, for example, allows the administrator to put an upper limit on the size of the directory name cache. This restricts the kernel from consuming as much memory as it desires. Linux uses the dentry structure to allocate kernel memory as files are instantiated. There is no upper bound to the size of the dentry cache, only routines to shrink the size of the cache when memory utilization becomes critical.

Details of the allocation, pruning, and shrinking can be found in /usr/src/linux/fs/dcache.c. FreeBSD uses a different cache mechanism. For every file referenced or created, a vnode is associated with it. A kernel counter generates a 32-bit number that guarantees uniqueness. When the counter rolls over, file names are removed from the cache entry thus limiting the number of file names that can be allocated to 64K. This still could consume 16M of memory if all file names were exactly 255 characters in length. Care must taken with Linux not to have thousands of small files that are relatively active have relatively large file names. Experience with the Solaris operating system has shown that this leads to frequent reboots to clear out the constructs of the kernel and free memory thus the limit in the latest versions of Solaris. The authors of BSD recommend rebooting your system at least once a year to clear out this buffer.

The third parameter, and least important, is the number and layout of the inodes. Since we were working with small files, using a different layout of the superblock and contiguous blocks on the disk had little or no effect on the benchmark. If we were using a database or something with a large dataset, we might put fewer superblocks on the disk and allocate more space for data. The risk is that a corrupted superblock will wipe out all the data in that sector. Since the superblock is written to multiple places, reducing the number of them reduces the locations that can be used for redundancy. We can tune the file system if we know the size of the files that we will be reading.

Since data is read in 512 byte chunks from the disk, we can tell the kernel to pre-fetch a group of data blocks if we know that our files will be larger than this size. In our instance, we did know this and tuned the kernel to read three blocks when one was read in. This tuning reduced the number of commands that were sent to the scsi controller and reduced the overhead of the system calls. The penalty for this optimization was available memory but improved reads on the average of 20ms per read. The mke2fs command has options to do things like block sizes of 1024 to 4096 bytes per block as well as the number of bytes per inode. These parameters are defined a file system creation time and cannot be changed once created.

Device Driver Optimizations

The third area that we looked at improving was at the device driver layer. There really isn't much that you can tune at the driver level. You can change the device characteristics but not much in the kernel. Some of the things that we tried were using RAID to stripe disks, using faster disks, and using disks with memory cache buffers. It turns out that using disks with memory cache buffers significantly improved performance.

We found a 4 Gigabyte disk that had a 4 Gigabyte shadow ram associated with it. When writes occurred, they happened at ram speed. When reads happened, there was no delay or seek optimizations to reduce head motion and disk latency. Everything happened at memory speeds with the overhead of a file system and disk driver. Using this tactic, we improved the performance of the gcc benchmark by an additional 20%. The benchmarks that we ran exposed some bugs in the striping software and hardware because performance decreased when we striped the disks. The faster disks reduced latency from 18ms per read to 12ms per read but when looking at a return of 1ms that we found with the shadow disks, this improvement was insignificant. The benchmark ran a few hundred operations to the disk so this cumulative speedup reduced the total execution time by a few seconds which was physically noticeable.

Summary

Overall, two things are important to remember when tuning applications. First, compiler switches and intelligent compilers can substantially improve performance. Switching the order of a doubly nested loop might yield orders of magnitude performance improvements, and many modern compilers know how to do this. Play with the compiler switches. For our Spec96-gcc benchmark, we got 3% performance improvement with the right switch combinations. It typically yields 3-5% improvement with simple manipulations.

Second, make sure you tune the operating system. Which libc you link with makes a big difference in performance. In our Spec96-gcc benchmark, we got a 15% performance improvement by using a less robust but speed optimized libc. It also makes a big difference in how secure your system might be so be careful when making this tradeoff. Simple tuning of operating system structures can yield some improvements, typically 2-3%, but no magic. The best we got for Spec96-gcc was a 2% improvement. These tunings might also hurt other applications that have different constructs and operating conditions.

Look at what your application is doing and decide if it is worth the time and effort to tune. When we tuned the operating system for gcc, we significantly reduced performance for the CPU and memory intensive applications. Most operating systems do not come out of the box tuned for any specific task other than running a large number of processes and opening a bunch of small trivial files. The best improvement we got was to use shadow ram disks that dropped disk I/O from 10-15ms to 1-2ms. This yielded us 35% better performance than with scsi or IDE disks. Overall, we improved performance by tuning the system for the specific application that we were interested in optimizing. We got 55% faster performance by tuning for disk reads and writes.

Pat Shuff received his Bachelors and Masters degrees in Electrical Engineering from Texas A&M University. He then spent thirteen years at Sun Microsystems focusing on operating systems, system architecture, and commerce software. This work also allowed him to work with companies like Compaq, IBM, and NCR in their benchmarking centers. Currently, Pat is finishing up his Doctorate in Computer Engineering at Texas A&M University.

Load Disqus comments

Firstwave Cloud