Anatomy of a Read and Write Call
A few years ago I was tasked with making
the Spec96 benchmark suite produce the fastest numbers possible
using the Solaris Intel operating system and Compaq Proliant
servers. We were given all the resources that Sun Microsystems and
Compaq Computer Corporation could muster to help take both
companies to the next level in Unix computing on the Intel
architecture. Sun had just announced its flagship operating system
on the Intel platform and Compaq was in a heated race with Dell for
the best departmental servers. Unixware and SCO were the primary
challengers since Windows NT 3.5 was not very stable at the time
and no one had ever heard of an upstart graduate student from
overseas who thought that he could build a kernel that rivaled
those of multi-billion dollar corporations.Now many years later, Linux has gained considerable market
share and is the De facto Unix for all the major hardware
manufacturers on the Intel architecture. In this article, I will
attempt to take the lessons learned from this tuning exercise and
show how they can be applied to the Linux operating system.As it turned out, the gcc benchmark was the one that everyone
seemed to be improving on the most. As we analyzed what the
benchmark was doing, we found out that basically it opened a file,
read its contents, created a new file, wrote new contents, then
closed both files. It did this over and over and over. File
operations proved to be the bottleneck in performance. We tried
faster processors with insignificant improvement. We tried
processors with huge (at the time) level 1 and level 2 cache and
still found no significant improvement. We tried using a gigabyte
of memory and found little or no improvement. By using the vmstat
command, we found that the processor was relatively idle, little
memory was being used, but we were getting a significant amount of
reads and writes to the root disk. Using the same hardware and same
test programs, Unixware was 25% faster than Solaris Intel.
Initially, we decided that Solaris was just really slow.
Unfortunately, I was working for Sun at the time and this was not
the answer that we could take to my management. We had to figure
out why it was slow and make recommendations on how to improve the
performance. The target was 25% faster than Unixware, not
slower.The first thing that we did was to look at the
configurations. It turns out that the two systems were identical
hardware,. We just booted a different disk to boot the other
operating system. The Unixware system was configured with /tmp as a
tmpfs whereas the Solaris system had /tmp on the root file system.
We changed the Solaris configuration to use tmpfs but it did not
significantly improve performance. Later, we found that this was
due to a bug in the tmpfs implementation on Solaris Intel. By
braking down the file operation, we decided to focus on three
areas; the libc interface, the node/dentry layer, and the device
drivers managing the disk. In this article, we will look at the
three different layers and talk about how to improve performance
and how they specifically apply to Linux.Test ProgramIf we take a characteristic program and look at what it does,
we can drill a little deeper into the operating system on each
pass. The program that we will use is relatively simple:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
main() {
int f_out, f_in;
char *buffer_out = "1234567890";
char *buffer_in;
if ((f_out = create("test_file",S_IWUSR | S_IRUSR) ) < 0) {
printf("error creating test_file\n");
exit(1);
}
if (write(f_out,buffer_out,(size_t)strlen(buffer_out)) < 0) {
printf("problems writing to test_file\n");
exit(1);
}
close(f_out);
if ((f_in = open("test_file",O_RDONLY)) < 0) {
printf("error opening test_file for read\n");
exit(1);
}
if (read(f_in,buffer_in,(size_t)strlen(buffer_out)) < 0) {
printf("error reading from test_file\n");
exit(1);
}
close(f_in);
}
The operation that we will perform will be simple.
Create -> libc -> kernel Write -> libc -> kernel Close -> libc -> kernel Open -> libc -> kernel Read -> libc -> kernel Close -> libc -> kernel libc -> kernel (to exit)
Libc optimizationsWhen we compile this program, by using the ldd command we see
that the routines; create, write, close, open, and read, are all
part of libc. On a RedHat 7.3 system, ldd returned that
/lib/i686/libc.so.6 and the loader are the only libraries that were
included when compiled. Further investigation with the nm command
shows that we actually link with the GLIB_2.0 which correlates to
the gcc compiler that we used to compile the program with and not
the libc in the operating system. Since libc is basically part of
the operating system, there does not seem like much we can do.
Fortunately, it turns out that there are a variety of options
available. Initially, for our benchmark we tried statically linking
our program which had marginal improvement but nothing substantial.
We then tried using the libc that came with the gcc compiler. It
had a noticeable improvement in performance but not as much as we
wanted. By mistake, we tried the Unixware libc dynamically linked
to the Solaris binary and got 30% better performance than with the
Solaris libc. Basically we had a substantial improvement in
performance and didn't do anything and didn't know why. Since we
didn't have the source to Unixware but did have the source to
Solaris and the gcc libc, we did a comparison. It turns out that
the Solaris implementation had substantially more test cases and
significant overhead that it imposed between the users program and
the system call to get into the kernel. A substantial amount of
code was written to make sure that buffers did not overflow or
pointers run off into the stack in the Solaris libraries.
Libc -> system_call -> file system -> device drivers (read and write) Libc -> system_call -> file system (open, close, create)
Basically, what is done at the libc layer is that the random
input from the user program is copied onto the stack and tests are
made to make sure that it is not malicious code or code that might
attempt to gain root access. A hardware interrupt is then generated
requesting that control be taken from a user process into the
kernel. The interrupt takes the data that is on the stack and
passes it into an interrupt handler. The code for this interrupt
handler can be found in /usr/src/linux/arch/i386/kernel/entry.S.
This interrupt handler decides that this is a call into the
operating system and transfers control to the kernel to process the
request or decides that the call is an invalid call and returns
with an error.If the kernel notices that it is a request to create a file,
it goes into the routines that deals with file systems. This is
done through the sys_call_table entry for sys_creat. This linkage
takes you to the file /usr/src/linux/kernel/module.c and the
sys_create_module routine. This routine figures out if the file
name already exists returns an error or creates the name in the
directory name space. If the kernel notices that it is a read from
a file on a file system, through sys_call_table structure sys_read,
it calls the device driver that controls the file system and
eventually controls the hardware for the attached disk. The
/usr/src/linux/fs/read_write.c routine is linked for reads and
writes. For the read command, this eventually resolved to the
kernel_read located in /usr/src/linux/fs/exec.c. This module
determines which file system that the file resides and calls the
read function using the device driver structures linking it to the
read function. Similar entries exist for write and close.It turns out that the libc on Solaris had substantial error
checking, boundary checking, and stack controls that prohibited
users from hijacking the operating system. Unixware and GNU did not
meticulously check for these error conditions thus was
substantially faster. Since our intent was to produce the fastest
benchmark numbers possible, we went with the Unixware libc and
continued our optimizations. Once we figured out that we had
optimized everything in user space with tricks like running the
application as a real-time thread, running /tmp in tmpfs,
dynamically linking with a fast libc, and running the test three
times to make sure that all of the code fit into cache and remained
memory resident for subsequent runs, we were ready to figure out
how to optimize the kernel.The decision that we made at this point was that performance
was the most important objective. Security, stability, and
reliability were no longer concerns for our system and secondary
objectives. Stability was important as long as the system did not
crash before or during our tests. If your intent is truly to
proceed the fastest linkage from a read command into the kernel,
you might look at bypassing the read and going straight into the
system_call. This is a bit risky and does reduce functionality but
for raw reads and writes it produces optimum code.File system optimizationsThe next layer that we needed to look at was the file system
layer. When a file is created, the operating system must first
figure out what file system that the user is currently working in.
Fortunately, libc passes this information to the kernel as a
parameter to the create call. A directory structure is laid out
when files are created. Unfortunately, this structure is allocated
in a chunk of memory and uses a link list to correlate a file with
an inode.During file creation the kernel allocates an inode that will
describe this file and associates this inode with the file name
that we created. Three things effect performance at this layer. The
first, and most important is the length of the file name. For ufs,
the structure that holds the file name is double indexed on Solaris
but only a character array on Linux. The first index on Solaris can
hold file names upto 14 characters. If the file name is larger than
that, the index is a pointer to an array that can handle really
long file names. This construct does not exist on Linux. It uses a
default name length of 255 characters for this relationship. For
ext3, the structure that holds the file name is held in the
directory structure that correlates an inode to a filename. The
filenames are stored with an EXT2_NAME_LEN structure that grows and
shrinks as does the filename. This does add some overhead and
complication. Performance is no longer limited by a single file
name length but by all of the files in the directory.If the combined names of all files in a directory exceed a
512 byte boundary, a second read to the disk will be needed to get
the other names of the files. The operation will be slowed by the
speed of reading a superblock which at best could be in memory and
at worst must be read from disk. For operations like ls -l, the
directory is hit multiple times to find all of the files located in
the directory. This typically takes On2 time. If we have 100 files
in a directory, we will need to walk the directory list 100 times
to get all of the files. This means that we read the first file
name in the directory 10000 times to get to the last filename. If
we can only get one or two filenames into a dirent structure, we
will then cover 256K-512K of memory and consume 50-100 blocks on
the disk, assuming a 512 byte block. For a 32M system, we have
consumed almost 20% of available memory to describe 1000 files. A
typical Linux system has about 200,000 unique filenames including
devices and special file types like sockets and /proc entries, thus
not all of the directory/filename combinations can not be retained
in memory and must be cached.The second parameter that affects performance is the cache
size of the directory names. Making this too large consumes a
significant amount of memory. Making it too small forces the kernel
to re-read the superblock from the disk to get the directory and
file information. As was described in the previous section, having
as many superblocks and filenames in memory greatly reduces
performance penalties. Linux does not take the same approach as
other Unix implementations. Solaris, for example, allows the
administrator to put an upper limit on the size of the directory
name cache. This restricts the kernel from consuming as much memory
as it desires. Linux uses the dentry structure to allocate kernel
memory as files are instantiated. There is no upper bound to the
size of the dentry cache, only routines to shrink the size of the
cache when memory utilization becomes critical.Details of the allocation, pruning, and shrinking can be
found in /usr/src/linux/fs/dcache.c. FreeBSD uses a different cache
mechanism. For every file referenced or created, a vnode is
associated with it. A kernel counter generates a 32-bit number that
guarantees uniqueness. When the counter rolls over, file names are
removed from the cache entry thus limiting the number of file names
that can be allocated to 64K. This still could consume 16M of
memory if all file names were exactly 255 characters in length.
Care must taken with Linux not to have thousands of small files
that are relatively active have relatively large file names.
Experience with the Solaris operating system has shown that this
leads to frequent reboots to clear out the constructs of the kernel
and free memory thus the limit in the latest versions of Solaris.
The authors of BSD recommend rebooting your system at least once a
year to clear out this buffer.The third parameter, and least important, is the number and
layout of the inodes. Since we were working with small files, using
a different layout of the superblock and contiguous blocks on the
disk had little or no effect on the benchmark. If we were using a
database or something with a large dataset, we might put fewer
superblocks on the disk and allocate more space for data. The risk
is that a corrupted superblock will wipe out all the data in that
sector. Since the superblock is written to multiple places,
reducing the number of them reduces the locations that can be used
for redundancy. We can tune the file system if we know the size of
the files that we will be reading.Since data is read in 512 byte chunks from the disk, we can
tell the kernel to pre-fetch a group of data blocks if we know that
our files will be larger than this size. In our instance, we did
know this and tuned the kernel to read three blocks when one was
read in. This tuning reduced the number of commands that were sent
to the scsi controller and reduced the overhead of the system
calls. The penalty for this optimization was available memory but
improved reads on the average of 20ms per read. The mke2fs command
has options to do things like block sizes of 1024 to 4096 bytes per
block as well as the number of bytes per inode. These parameters
are defined a file system creation time and cannot be changed once
created.Device Driver OptimizationsThe third area that we looked at improving was at the device
driver layer. There really isn't much that you can tune at the
driver level. You can change the device characteristics but not
much in the kernel. Some of the things that we tried were using
RAID to stripe disks, using faster disks, and using disks with
memory cache buffers. It turns out that using disks with memory
cache buffers significantly improved performance.We found a 4 Gigabyte disk that had a 4 Gigabyte shadow ram
associated with it. When writes occurred, they happened at ram
speed. When reads happened, there was no delay or seek
optimizations to reduce head motion and disk latency. Everything
happened at memory speeds with the overhead of a file system and
disk driver. Using this tactic, we improved the performance of the
gcc benchmark by an additional 20%. The benchmarks that we ran
exposed some bugs in the striping software and hardware because
performance decreased when we striped the disks. The faster disks
reduced latency from 18ms per read to 12ms per read but when
looking at a return of 1ms that we found with the shadow disks,
this improvement was insignificant. The benchmark ran a few hundred
operations to the disk so this cumulative speedup reduced the total
execution time by a few seconds which was physically
noticeable.SummaryOverall, two things are important to remember when tuning
applications. First, compiler switches and intelligent compilers
can substantially improve performance. Switching the order of a
doubly nested loop might yield orders of magnitude performance
improvements, and many modern compilers know how to do this. Play
with the compiler switches. For our Spec96-gcc benchmark, we got 3%
performance improvement with the right switch combinations. It
typically yields 3-5% improvement with simple manipulations.Second, make sure you tune the operating system. Which libc
you link with makes a big difference in performance. In our
Spec96-gcc benchmark, we got a 15% performance improvement by using
a less robust but speed optimized libc. It also makes a big
difference in how secure your system might be so be careful when
making this tradeoff. Simple tuning of operating system structures
can yield some improvements, typically 2-3%, but no magic. The best
we got for Spec96-gcc was a 2% improvement. These tunings might
also hurt other applications that have different constructs and
operating conditions.Look at what your application is doing and decide if it is
worth the time and effort to tune. When we tuned the operating
system for gcc, we significantly reduced performance for the CPU
and memory intensive applications. Most operating systems do not
come out of the box tuned for any specific task other than running
a large number of processes and opening a bunch of small trivial
files. The best improvement we got was to use shadow ram disks that
dropped disk I/O from 10-15ms to 1-2ms. This yielded us 35% better
performance than with scsi or IDE disks. Overall, we improved
performance by tuning the system for the specific application that
we were interested in optimizing. We got 55% faster performance by
tuning for disk reads and writes.Pat Shuff received his Bachelors and Masters degrees in
Electrical Engineering from Texas A&M University. He then spent
thirteen years at Sun Microsystems focusing on operating systems,
system architecture, and commerce software. This work also allowed
him to work with companies like Compaq, IBM, and NCR in their
benchmarking centers. Currently, Pat is finishing up his Doctorate
in Computer Engineering at Texas A&M University.
email: shuff@tamu.edu










This week 5 lucky Members will receive a copy of The Official Ubuntu Server Book by Benjamin Mako Hill and Linux Journal's very own Kyle Rankin. No entry necessary. Check back here early next week to find out who the lucky Online Members are.




Comments
Re: Anatomy of a Read and Write Call
buffer_in is never intialized. It should be defined char buffer_in[sizeof(buffer_out)].
Small Spelling Mistake
> ... Three things effect performance at this layer. ...
I think you mean affect.
Re: Small Spelling Mistake
If they were done well, they would effect good performance
False Security
If the system depends on parameter checking for system calls to be in libc, then the system is not secure. The kernel must do these checks... after all, a program can bypass the C library and do a system call itself.
Also, checking for NULL buffers is silly. A program that reads into a NULL or writes from a NULL is broken and trying to make it looks like it is doing the right thing is probably a bad idea.
Post new comment