Linux in the Corporate Data Center
In order for Linux to be accepted into the corporate data center, it will need to perform well on "big iron", things like mainframes or large (greater than 8-way) SMP microprocessor-based systems. When most people consider performance on large scale systems, they focus only on aspects of throughput or response time. But in order to be successful in the data center, all aspects of scalability must be addressed, including aspects of reliability, availability, serviceability, usability and manageability (RASUM). The RASUM issues for a large scale systems are different than the ones for a desktop or smaller servers and need to be considered in software design.
Requirements for Linux in the data center can be broken down into two main categories: 1) performance and scalability, and 2) RASUM. This article will discuss some of the main reasons why changes to the kernel are required for performance and scalability.
The question could be asked, "Why not just call this performance or scalability?" Because in most cases it doesn't matter how well the system scales unless a base level of performance is achieved. But optimizations for low CPU utilization loads at the expense of higher processor loads (either in single or multiprocessor configurations) should not be the first priority. The goal is to achieve acceptable performance at both ends of the performance spectrum.
Starting with good base-level performance before working on scalability issues seems intuitive. And the reverse case--where one would intentionally give up base-level performance to improve scalability--runs counter to conventional wisdom, since most systems are single CPU machines. In the second approach, the intent is not to make single CPU machines run slower , but to give up some level of performance on a lightly loaded system to ensure that performance doesn't suffer when the load gets heavy. This sort of technique will work on both multiprocessor and single processor systems. When done correctly, users will not notice the less efficient algorithm under light loads, but they will certainly enjoy the benefits of greater efficiencies under heavy loads.
So what are some good examples of base-level performance enhancements that are required for Linux to be accepted into the corporate data center? In the interest of brevity, a few related examples will be discussed in this article, using feedback from developers of RDBMS software used by large corporations. I picked these performance issues because the normal usage model of a large data center system is to run database workloads. So let's focus on only some of the possible performance improvements needed for this type of workload.
Some requirements for data center performance are: an asynchronous I/O method that supports raw and list directed I/O; and vectored super-user privileges that include critical code/data lock down, making non-interruptible critical code regions and process thread preferred scheduling per CPU.
To best understand why these items are considered critical by many developers of large scale mission critical database systems, we need to look at the complexity of these systems. The current commercial RDBMS used in corporate data centers are very large, complex pieces of software. The code base for a RDBMS kernel can exceed the size of the Linux kernel, and its developers' sophistication is commensurate with that of OS kernel developers. RDBMS developers have such deep knowledge about how their "application" behaves that, in many cases, they just need the OS to "get out of the way" in order to achieve the best performance.
So a little explanation as to why the different features are a win for Linux running a RDBMS workload goes a long way in realizing that this isn't a matter of "fixing" the RDBMS code to work with Linux. (Though this isn't an attempt to fully characterize an RDBMS kernel because we don't have the time or the space.)
One of the main functions of a RDBMS kernel is to use local cached data instead of doing an actual physical I/O. The goal is to also remove any possible synchronous I/O events out of the critical path of a transaction. The general optimization is to avoid I/O if possible; if it must be done, make the I/O asynchronous and eliminate random I/O operations.
Let us think about a transaction which is the normal unit of work that a database engine performs. A transaction might require the query of multiple records--sometimes even hundreds of records--and then update only a few of them. In a day, some records might be updated by many transactions and others by only one transaction. In some types of queries the possibility of the data being reused in time to make it worth caching is zero, and in others it is guaranteed to be reused immediately. But once the I/O is issued by the RDBMS back end, there is no way to pass this information on to the OS, so it can either discard the data block as soon as it is used or keep it around for a short period of time for reuse. The RDBMS kernel solves this problem by having its own Global Buffer Area (GBA) and by examining the type of transaction to know if the read data should be either put into a location that is a least recently used (LRU) buffer or into a special set of buffers for immediate reuse. So when the RDBMS issues a read, there is no reason to cache the read data into the Linux buffer cache, as it is just making a copy of data that needs to remain inside of the RDBMS.
Another reason for not using the Linux buffer cache is because you would still need to copy the data from the Linux cache to the RDBMS cache. This causes similar performance issues as the problem with using a bounce buffer to support memory above the 32-bit limit when you don't have DMA hardware that can transfer to the higher addresses.
These are the primary reasons for the RDBMS wanting to support RAW or Direct I/O bypassing the file system. Since another goal is to remove synchronous I/O operations from the critical path of a transaction, the normal usage is to call the data blocks inside of the GBA as pages. These may or may not correspond to the underlying hardware page size, but in many way they are managed and used like the pages in any other type of VM subsystem. And as with many VMs there are triggers that cause various daemons to perform various functions. One of these functions writes out dirty database pages. We will call this function a database writer (DW).
The function of the DW is to scan the GBA and look for pages that are "dirty" or modified. It then orders the I/O in a logical manner. If we make up an example GBA and then test some numbers, we will see the justification for a list directed I/O subsystem.
Page Size = 8,192 bytesNumber of Pages = 131,072 (1GB GBA)Dirty Page Maximum = 95%Cleaning Limit = 85%
In this example, when the database reaches its dirty page limit, it will try and write out at least 13,107 pages of data. If the system is very active it might try to write out a great deal more. Most importantly, since the process will know that it has to write out this large number, why not put multiple I/O requests into one system call? If you could schedule 100 I/O operations in one system call, you could reduce the system call overhead by 100x, allowing the CPU to be used for more productive functionality.
There really are only two I/O operations that a RDBMS cannot defer. The first is the database log write, as this is the method that allows it to defer writing the pages that have been modified. The performance issues involving this are resolved by piggy-backing multiple transactions into large serial writes that are much quicker than small, random I/O writes.
The second is when a transaction is required to read many data blocks from the physical disk drives. The problem is that there might be tens or even hundreds of physical I/Os required to execute the transaction. If you assume that a physical I/O takes 32 msec to get from request generation to the return of the call, a transaction that requires 50 serialized reads would each have 1.6 seconds in read I/O overhead. But if you could schedule these I/Os using a true asynchronous read that supported a vectored I/O list, you would be able to reduce this time to 32 msec for the entire I/O read, with only 1.6 total seconds of overhead.
Another way that RDBMS designers ensure they can achieve the level of predictable performance their customers desire is through the management of critical code regions. The ability to shield these code and data segments from paging or swap ensures that thousands of users don't pause while one process holds critical code or a data region is swapped back into memory. Since the code being bumped to handle an interrupt would result in the same issue, the ability to turn off interrupts when entering these regions ensures further predictably in the response times.
These functions were reserved for the kernel or root processes and have been referred to in the past as super user privileges. And since one would obviously not want to make these abilities available to every user, these features need to be controlled case-by-case, based upon things like user id, process id or executable file. A good way of thinking of these features is as vectored super-user privileges.
Tim Witham is the director of the Open Source Development Lab (OSDL), the industry's first independent, non-profit lab for developers adding enterprise capabilities to Linux. He has been a developer and architect of large-scale commercial Linux and UNIX systems for over 19 years.