How Many Disks Are Too Many for a Linux System?
With recent advances in network bandwidth, resources and facilities are becoming centralized once again. The trend of departmental computers and services is fading as larger, more centralized services appear. In regards to this centralization, one of the questions most IT managers ask is “What size server is really needed?” In this article, we analyze how large a server is needed for file services and how many disks a typical Linux system can support. We assume we are dealing with a typical Intel architecture solution and that one or more high speed network connections is available to handle the load and requests from the client systems. If we assume two or more Gigabit network adaptors, this should provide enough bandwidth for most applications and data centers.
When attempting to see how many disks can fit into a system, we must analyze two parts of the problem. The first is the operating system itself. Linux has some limitations that prohibit it from hosting a million disk drives. Some of these limitations can be worked around, but others are hard limits. The second problem we face is hardware. The backplane, power requirements and physical construction of the computer physically restrict the number of disks that can be attached to a system.
We start by looking at the hardware limitations of three systems commercially available today. We chose low-end, mid-range and high-end systems: an HP Presario 4400US, a Dell PowerEdge and an IBM zSeries, respectively.
Low-end Computer: The HP Presario lists for $549 and has one extra IDE drive slot for expansion and two additional PCI slots for disk controller expansion. Because the extra disks are not able to pull enough power from the system power supply, the disks need to be attached to a separate chassis and power supply. If we continue with the HP option of the HP Surestore Disk System 2300, we can get 14 disks per disk controller. This will give us a total of 30 disks, about 2TB of storage (assuming 73GB per disk). Each of these disks will yield about 11MB/sec across the 160MB/sec communication channel. We could go with a more robust fibre channel storage solution, the HP Surestore Disk Array XP1024. It allows for 1,024 disks or about 74TB storage per controller. Unfortunately, the system bus on our Presario 4400 only goes to 100MHz, while the Surestore XP1024 runs at 2GB/sec. The system we selected would not be able to handle such a high-end disk system, thus limiting us to the Ultra160 technology and 30 disks.
Mid-range Computer: If we decide to raise our expectations and look at a Dell PowerEdge 2650, which lists for $2,199, we can attach significantly more storage. The internal controller supports five SCSI disks and has the option of three expansion slots, one at 133Hz and two at 100MHz. The motherboard backplane runs at 400MHz, so the embedded SCSI controller can operate much faster than the expansion controllers. By using the PowerVault 220S/221S SCSI Disk Storage Unit, we can attach 47 disks, three 14-unit PowerVaults and five internal disks, for a total storage size of 3.4TB. We also can expand the memory of this system to 6GB of RAM, which would handle the disk operations much better than the 512MB limitation of the Presario 4400. On this system, we also could go for a more robust fibre channel storage solution, the Dell/EMC FC4700-2. This system allows 120 drives per fibre channel, or 360 disks yielding 63TB of storage on a single system. Since this system transfers data at 360MB/sec, the backplane running at 133MHz could easily handle the operations, but we are reaching the limits for the 100MHz backplane.
High-end Computer: At the very high end, we could use the IBM zSeries 900, which supports up to 64GB of memory and multiple fibre channel connectors. There really isn't a limit to the number of disks that can be supported through this fibre channel configuration, but connection to a single disk subsystem is limited to 224 drives on a single instance (16TB per subsystem). Direct SCSI is not supported, but a fibre to SCSI interchange is available.
In summary, on the low end, the hardware limitations begin with the backplane support of multiple controllers limiting the system to 30 disks. The mid-range system tops out at 360 disks, while the high-end machine does not have a practical upper limit. All of these systems could utilize storage area networks or network attached storage to increase their limits, but initially we wanted to analyze the hardware limitations imposed by the architecture for direct attached storage.
In attempting to analyze how many disks can fit onto a single Linux system and what limits the operating system places on our design, we must first break the analysis into subcomponents. The three resources that most computers have difficulty with are CPU, memory and physical I/O.
Let's take a Pentium 4 2GHz processor operating on a 533MHz backplane. This backplane yields 4.3GB/sec data transfer rates by using source-synchronous transfer of address and data to achieve four data transfers per bus clock. If we first assume that, on average, you get about two instructions executing per clock cycle, this translates to about 4x109 instructions per second. If we then assume that every interrupt for an I/O operation takes 30x103 instructions, this correlates to about 133x103 I/O operations per second. On the surface this looks like a very large amount of I/O on a single system. A single disk that has 11ms latency between operations can produce about 90 I/O operations per second. This directly correlates to about 1,500 disks per CPU that can be supported at maximum transfer rate from the disk. Clearly, a single processor can handle a substantial amount of disks. If no other factor limits our design, we can support 1,500 disks per processor on our system.
Unfortunately, other factors do limit the operation. The most obvious one is the backplane. With a 533MHz backplane, we get a 4.3GB/sec data transfer rate. If we use our 133x103 I/O operations per second, this translates into 4.3GB/sec, which also is our processor limit. If we drop down to a Pentium III operating at 1.4GHz with a 133MHz clock speed, we don't get the source-synchronous transfer advantage and get one data transfer per clock. This significantly drops our backplane performance down to 1.1GB/sec or about 400 disks per processor. This number is still substantial but not the eventual limiting factor in performance.
Memory typically is not a problem for Linux because it uses a Berkeley-style memory allocation method, similar to FreeBSD, where memory is allocated as needed and freed when critical limits are reached. If we look at the inode allocation—the control blocks that manage data blocks on the disk—it is done on an as-needed basis. That is, the inodes are released when the operating system runs out of memory. Theoretically, if a system is doing nothing more than managing filesystems and disk drives, it can use all of the available memory to manage these resources.
An inode is relatively small. If we look at the ext3 filesystem, a typical inode consumes 112 bytes of data to describe its qualities and characteristics. Each inode describes a 512-byte of data on the disk. If we have a 20GB disk, we will need about 40,000 inodes to describe the disk. This will consume about 4MB of memory. Using our limits from the systems that we initially analyzed, we find that the low-end machine would require 128MB of memory if all inodes were active. Since most Intel systems limit out at about 4GB of memory, the memory to describe the storage will be adequate for the low end. Unfortunately, we can put 360 disks on the midrange and 224 disks per fiber channel on the high end. The mid-range system would consume about 30% of the available memory just describing the inodes, leaving not a lot room for the filesystem required on top of the inode tables. The high-end limits are reached at two fiber channels, thus limiting the number of disks that we can attach to about 500 disks.
Additional research is required to determine if addressing problems come into play. When we are talking about a few hundred disks on a controller, addressing should not be a problem. In fact, getting 128 disks should be relatively simple. When we start talking about thousands of disks, however, the major and minor numbers become a problem. In the Linux sd.c, for example, there is a hard limit of 16 disks per major block number, and there are eight major block numbers for a SCSI controller. This limits the number of disks the operating system can describe to 128 disks per SCSI controller. This limitation is not true for network attached SCSI devices or fiber channel devices, because they have different device drivers. This number also will change for different controllers. If additional hardware exists on the SCSI controller to address and notice additional disks, the generic sd.c can be replaced to expand this limitation.
One of the new technologies gaining momentum is network storage. Fiber channel disks and internet SCSI devices are removing the disks from attached channels and placing them on private high-speed networks. When we move the connection problem from devices to high speed networks, we see some additional improvements but not a lot. If we look at the Ethernet protocol specified in IEEE 803.2, we notice that each internet packet requires 576 to 12,208 bits to transport data across the network. If we layer protocols like IP, UDP and TCP on top of the specification, we require close to 5,000 bits to do a simple 512-byte data transfer. Layering the iSCSI protocol on top of these transports brings the requirement up to 6,000 bits.
If we implement our storage network over a 10baseT network and don't want performance that is better than network filesystem protocols, we must limit our latency to less than 100ms. This gives us roughly 165 supportable disks from one client or server. If we use 100baseT networks, this number grows to 1,650 devices. Most storage area network solutions (SANS) utilize a 2GB fiber channel, which will allow us to address up to 33,000 disks on a private network.
If we centralize our disks and share them via a network filesystem, we achieve better control of our data by having it attached to a single resource, but we also create a single point of failure and performance bottleneck. Most network filesystems carry a large amount of overhead relating to authentication and authorization. If we look at the NFS protocol, for example, it requires about 1,000 bytes to read directory attributes, 1,000 bytes to look up directory information, 1,000 bytes to read the file attributes and 1,250 bytes to read a 512-byte block of data. This overhead limits the number of clients that can attach to a centralized storage system to 24 clients using a 10baseT network, 238 clients on a 100baseT network and 2,400 clients on a Gigabit Ethernet network.
In looking at the limitations on the number of disks that can be attached to a single system, we note that the limitations are not tied to the Linux operating system. Instead, the limitations are bandwidth, latency and addressability—functions of the hardware selected for connectivity. The main bottleneck in disk connectivity is the amount of memory required to address each track and sector on a disk drive. The inode structure consumes a significant amount of memory. The differences between a low-end computer and a high-end computer typically limit the solution due to backplane expandability and speed. The difference between the mid-range and high-end computers is relatively trivial if the storage is placed on the network. If the storage is attached to the system and shared using a network filesystem, a high-end system with additional processors is needed to handle the disk and network transactions, as well as the additional memory required to handle disk and network filesystem requests.
To summarize the results, for a very low-end system, we can attach 30 disks using SCSI controllers and an additional 219 on a private 10baseT network. These disks require a minimum of 4GB of memory to come close to addressing a fraction of the disks. Since most low-end systems are limited to 2GB, we are able to attach as many disks as we want to the system, but we must limit the number of active inodes allocated to a few thousand. We could expand the number of network disks by upgrading to a faster network card, thereby increasing this number to 16,000 disks using a private Gigabyte Ethernet network. This many disks will probably outstrip a single processor and backplane and make connecting this many disks, therefore, undesirable. Moving up to the mid-range and high-end systems, the number of disks is in the thousands because we can put multiple network adaptors in the system and push the limitation to the backplane and memory system. We can put more and more processors with more memory into the high-end systems and eliminate the performance bottlenecks with more resources. The true bottleneck becomes the backplane and how much data can be pumped from the network controllers into memory. For the mid-range and high-end systems, the number of attached disks possible is limited to a few hundred. The number of network disks is limited to a few tens of thousands, if we are willing to accept 100ms latency to the disk; otherwise, a few thousand disks is the limit if we want speeds similar to attached-disk characteristics.