Kernel Korner - ATA Over Ethernet: Putting Hard Drives on the LAN
Everybody runs out of disk space at some time. Fortunately, hard drives keep getting larger and cheaper. Even so, the more disk space there is, the more we use, and soon we run out again.
Some kinds of data are huge by nature. Video, for example, always takes up a lot of space. Businesses often need to store video data, especially with digital surveillance becoming more common. Even at home, we enjoy watching and making movies on our computers.
Backup and data redundancy are essential to any business using computers. It seems no matter how much storage capacity there is, it always would be nice to have more. Even e-mail can overgrow any container we put it in, as Internet service providers know too well.
Unlimited storage becomes possible when the disks come out of the box, decoupling the storage from the computer that's using it. The principle of decoupling related components to achieve greater flexibility shows up in many domains, not only data storage. Modular source code can be used more flexibly to meet unforeseen needs, and a stereo system made from components can be used in more interesting configurations than an all-in-one stereo box can be.
The most familiar example of out-of-the-box storage probably is the storage area network (SAN). I remember when SANs started to create a buzz; it was difficult to work past the hype and find out what they really were. When I finally did, I was somewhat disappointed to find that SANs were complex, proprietary and expensive.
In supporting these SANs, though, the Linux community has made helpful changes to the kernel. The enterprise versions of 2.4 kernel releases informed the development of new features of the 2.6 kernel, and today's stable kernel has many abilities we lacked only a few years ago. It can use huge block devices, well over the old limit of two terabytes. It can support many more simultaneously connected disks. There's also support for sophisticated storage volume management. In addition, filesystems now can grow to huge sizes, even while mounted and in use.
This article describes a new way to leverage these new kernel features, taking disks out of the computer and overcoming previous limits on storage use and capacity. You can think of ATA over Ethernet (AoE) as a way to replace your IDE cable with an Ethernet network. With the storage decoupled from the computer and the flexibility of Ethernet between the two, the possibilities are limited only by your imagination and willingness to learn new things.
ATA over Ethernet is a network protocol registered with the IEEE as Ethernet protocol 0x88a2. AoE is low level, much simpler than TCP/IP or even IP. TCP/IP and IP are necessary for the reliable transmission of data over the Internet, but the computer has to work harder to handle the complexity they introduce.
Users of iSCSI have noticed this issue with TCP/IP. iSCSI is a way to send I/O over TCP/IP, so that inexpensive Ethernet equipment may be used instead of Fibre Channel equipment. Many iSCSI users have started buying TCP offload engines (TOE). These TOE cards are expensive, but they remove the burden of doing TCP/IP from the machines using iSCSI.
An interesting observation is that most of the time, iSCSI isn't actually used over the Internet. If the packets simply need to go to a machine in the rack next door, the heavyweight TCP/IP protocol seems like overkill.
So instead of offloading TCP/IP, why not dispense with it altogether? The ATA over Ethernet protocol does exactly that, taking advantage of today's smart Ethernet switches. A modern switch has flow control, maximizing throughput and limiting packet collisions. On the local area network (LAN), packet order is preserved, and each packet is checksummed for integrity by the networking hardware.
Each AoE packet carries a command for an ATA drive or the response from the ATA drive. The AoE Linux kernel driver performs AoE and makes the remote disks available as normal block devices, such as /dev/etherd/e0.0—just as the IDE driver makes the local drive at the end of your IDE cable available as /dev/hda. The driver retransmits packets when necessary, so the AoE devices look like any other disks to the rest of the kernel.
In addition to ATA commands, AoE has a simple facility for identifying available AoE devices using query config packets. That's all there is to it: ATA command packets and query config packets.
Anyone who has worked with or learned about SANs likely wonders at this point, “If all the disks are on the LAN, then how can I limit access to the disks?” That is, how can I make sure that if machine A is compromised, machine B's disks remain safe?
The answer is that AoE is not routable. You easily can determine what computers see what disks by setting up ad hoc Ethernet networks. Because AoE devices don't have IP addresses, it is trivial to create isolated Ethernet networks. Simply power up a switch and start plugging in things. In addition, many switches these days have a port-based VLAN feature that allows a switch effectively to be partitioned into separate, isolated broadcast domains.
The AoE protocol is so lightweight that even inexpensive hardware can use it. At this time, Coraid is the only vendor of AoE hardware, but other hardware and software developers should be pleased to find that the AoE specification is only eight pages in length. This simplicity is in stark contrast to iSCSI, which is specified in hundreds of pages, including the specification of encryption features, routability, user-based access and more. Complexity comes at a price, and now we can choose whether we need the complexity or would prefer to avoid its cost.
Simple primitives can be powerful tools. It may not come as a surprise to Linux users to learn that even with the simplicity of AoE, a bewildering array of possibilities present themselves once the storage can reside on the network. Let's start with a concrete example and then discuss some of the possibilities.
The following example is based on a true story. Stan is a fictional sysadmin working for the state government. New state legislation requires that all official documents be archived permanently. Any state resident can demand to see any official document at any time. Stan therefore needs a huge storage capacity that can grow without bounds. The performance of the storage needn't be any better than a local ATA disk, though. He wants all of the data to be retrievable easily and immediately.
Stan is comfortable with Ethernet networking and Linux system administration, so he decides to try ATA over Ethernet. He buys some equipment, paying a bit less than $6,500 US for all of the following:
One dual-port gigabit Ethernet card to replace the old 100Mb card in his server.
One 26-port network switch with two gigabit ports.
One Coraid EtherDrive shelf and ten EtherDrive blades.
Ten 400GB ATA drives.
The shelf of ten blades takes up three rack units. Each EtherDrive blade is a small computer that performs the AoE protocol to effectively put one ATA disk on the LAN. Striping data over the ten blades in the shelf results in about the throughput of a local ATA drive, so the gigabit link helps to use the throughput effectively. Although he could have put the EtherDrive blades on the same network as everyone else, he has decided to put the storage on its own network, connected to the server's second network interface, eth1, for security and performance.
Stan reads the Linux Software RAID HOWTO (see the on-line Resources) and decides to use a RAID 10—striping over mirrored pairs—configuration. Although this configuration doesn't result in as much usable capacity as a RAID 5 configuration, RAID 10 maximizes reliability, minimizes the CPU cost of performing RAID and has a shorter array re-initialization time if one disk should fail.
After reading the LVM HOWTO (see Resources), Stan comes up with a plan to avoid ever running out of disk space. JFS is a filesystem that can grow dynamically to large sizes, so he is going to put a JFS filesystem on a logical volume. The logical volume resides, for now, on only one physical volume. That physical volume is the RAID 10 block device. The RAID 10 is created from the EtherDrive storage blades in the Coraid shelf using Linux software RAID. Later, he can buy another full shelf, create another RAID 10, make it into a physical volume and use the new physical volume to extend the logical volume where his JFS lives.
Listing 1 shows the commands Stan uses to prepare his server for doing ATA over Ethernet. He builds the AoE driver with AOE_PARTITIONS=1, because he's using a Debian sarge system running a 2.6 kernel. Sarge doesn't support large minor device numbers yet (see the Minor Numbers sidebar), so he turns off disk partitioning support in order to be able to use more disks. Also, because of Debian bug 292070, Stan installs the latest device mapper and LVM2 userland software.
Listing 1. The first step in building a software RAID device from several AoE drives is setting up AoE.
# setting up the host for AoE # build and install the AoE driver tar xvfz aoe-2.6-5.tar.gz cd aoe-2.6-5 make AOE_PARTITIONS=1 install # AoE needs no IP addresses! :) ifconfig eth1 up # let the network interface come up sleep 5 # load the ATA over Ethernet driver modprobe aoe # see what aoe disks are available aoe-stat
Minor Device Numbers
A program that wants to use a device typically does so by opening a special file corresponding to that device. A familiar example is the /dev/hda file. An ls -l command shows two numbers for /dev/hda, 3 and 0. The major number is 3 and the minor number is 0. The /dev/hda1 file has a minor number of 1, and the major number is still 3.
Until kernel 2.6, the minor number was eight bits in size, limiting the possible minor numbers to 0 through 255. Nobody had that many devices, so the limitation didn't matter. Now that disks have been decoupled from servers, it does matter, and kernel 2.6 uses 20 bits for the minor device number.
Having 1,048,576 values for the minor number is a big help to systems that use many devices, but not all software has caught up. If glibc or a specific application still thinks of minor numbers as eight bits in size, you are going to have trouble using minor device numbers over 255.
To help during this transitional period, the AoE driver may be compiled without support for partitions. That way, instead of there being 16 minor numbers per disk, there's only one per disk. So even on systems that haven't caught up to the large minor device numbers of 2.6, you still can use up to 256 AoE disks.
The commands for creating the filesystem and its logical volume are shown in Listing 2. Stan decides to name the volume group ben and the logical volume franklin. LVM2 now needs a couple of tweaks made to its configuration. For one, it needs a line with types = [ "aoe", 16 ] so that LVM recognizes AoE disks. Next, it needs md_component_detection = 1, so the disks inside RAID 10 are ignored when the whole RAID 10 becomes a physical volume.
Listing 2. Setting Up the Software RAID and the LVM Volume Group
# speed up RAID initialization for f in `find /proc | grep speed`; do echo 100000 > $f done # create mirrors (mdadm will manage hot spares) mdadm -C /dev/md1 -l 1 -n 2 \ /dev/etherd/e0.0 /dev/etherd/e0.1 mdadm -C /dev/md2 -l 1 -n 2 \ /dev/etherd/e0.2 /dev/etherd/e0.3 mdadm -C /dev/md3 -l 1 -n 2 \ /dev/etherd/e0.4 /dev/etherd/e0.5 mdadm -C /dev/md4 -l 1 -n 2 -x 2 \ /dev/etherd/e0.6 /dev/etherd/e0.7 \ /dev/etherd/e0.8 /dev/etherd/e0.9 sleep 1 # create the stripe over the mirrors mdadm -C /dev/md0 -l 0 -n 4 \ /dev/md1 /dev/md2 /dev/md3 /dev/md4 # make the RAID 10 into an LVM physical volume pvcreate /dev/md0 # create an extendible LVM volume group vgcreate ben /dev/md0 # look at how many "physical extents" there are vgdisplay ben | grep -i 'free.*PE' # create a logical volume using all the space lvcreate --extents 88349 --name franklin ben modprobe jfs mkfs -t jfs /dev/ben/franklin mkdir /bf mount /dev/ben/franklin /bf
I duplicated Stan's setup on a Debian sarge system with two 2.1GHz Athlon MP processors and 1GB of RAM, using an Intel PRO/1000 MT Dual-Port NIC and puny 40GB drives. The network switch was a Netgear FS526T. With a RAID 10 across eight of the EtherDrive blades in the Coraid shelf, I saw a sustainable read throughput of 23.58MB/s and a write throughput of 17.45MB/s. Each measurement was taken after flushing the page cache by copying a 1GB file to /dev/null, and a sync command was included in the write times.
The RAID 10 in this case has four stripe elements, each one a mirrored pair of drives. In general, you can estimate the throughput of a collection of EtherDrive blades easily by considering how many stripe elements there are. For RAID 10, there are half as many stripe elements as disks, because each disk is mirrored on another disk. For RAID 5, there effectively is one disk dedicated to parity data, leaving the rest of the disks as stripe elements.
The expected read throughput is the number of stripe elements times 6MB/s. That means if Stan bought two shelves initially and constructed an 18-blade RAID 10 instead of his 8-blade RAID 10, he would expect to get a little more than twice the throughput. Stan doesn't need that much throughput, though, and he wanted to start small, with a 1.6TB filesystem.
Listing 3 shows how Stan easily can expand the filesystem when he buys another shelf. The listings don't show Stan's mdadm-aoe.conf file or his startup and shutdown scripts. The mdadm configuration file tells an mdadm process running in monitor mode how to manage the hot spares, so that they're ready to replace any failed disk in any mirror. See spare groups in the mdadm man page.
Listing 3. To expand the filesystem without unmounting it, set up a second RAID 10 array, add it to the volume group and then increase the filesystem.
# after setting up a RAID 10 for the second shelf # as /dev/md5, add it to the volume group vgextend ben /dev/md5 vgdisplay ben | grep -i 'free.*PE' # grow the logical volume and then the jfs lvextend --extents +88349 /dev/ben/franklin mount -o remount,resize /bf
The startup and shutdown scripts are easy to create. The startup script simply assembles each mirrored pair RAID 1, assembles each RAID 0 and starts an mdadm monitor process. The shutdown script stops the mdadm monitor, stops the RAID 0s and, finally, stops the mirrors.
Now that we've seen a concrete example of ATA over Ethernet in action, readers might be wondering what would happen if another host had access to the storage network. Could that second host mount the JFS filesystem and access the same data? The short answer is, “Not safely!” JFS, like ext3 and most filesystems, is designed to be used by a single host. For these single-host filesystems, filesystem corruption can result when multiple hosts mount the same block storage device. The reason is the buffer cache, which is unified with the page cache in 2.6 kernels.
Linux aggressively caches filesystem data in RAM whenever possible in order to avoid using the slower block storage, gaining a significant performance boost. You've seen this caching in action if you've ever run a find command twice on the same directory.
Some filesystems are designed to be used by multiple hosts. Cluster filesystems, as they are called, have some way of making sure that the caches on all of the hosts stay in sync with the underlying filesystem. GFS is a great open-source example. GFS uses cluster management software to keep track of who is in the group of hosts accessing the filesystem. It uses locking to make sure that the different hosts cooperate when accessing the filesystem.
By using a cluster filesystem such as GFS, it is possible for multiple hosts on the Ethernet network to access the same block storage using ATA over Ethernet. There's no need for anything like an NFS server, because each host accesses the storage directly, distributing the I/O nicely. But there's a snag. Any time you're using a lot of disks, you're increasing the chances that one of the disks will fail. Usually you use RAID to take care of this issue by introducing some redundancy. Unfortunately, Linux software RAID is not cluster-aware. That means each host on the network cannot do RAID 10 using mdadm and have things simply work out.
Cluster software for Linux is developing at a furious pace. I believe we'll see good cluster-aware RAID within a year or two. Until then, there are a few options for clusters using AoE for shared block storage. The basic idea is to centralize the RAID functionality. You could buy a Coraid RAIDblade or two and have the cluster nodes access the storage exported by them. The RAIDblades can manage all the EtherDrive blades behind them. Or, if you're feeling adventurous, you also could do it yourself by using a Linux host that does software RAID and exports the resulting disk-failure-proofed block storage itself, by way of ATA over Ethernet. Check out the vblade program (see Resources) for an example of software that exports any storage using ATA over Ethernet.
Because ATA over Ethernet puts inexpensive hard drives on the Ethernet network, some sysadmins might be interested in using AoE in a backup plan. Often, backup strategies involve tier-two storage—storage that is not quite as fast as on-line storage but also is not as inaccessible as tape. ATA over Ethernet makes it easy to use cheap ATA drives as tier-two storage.
But with hard disks being so inexpensive and seeing that we have stable software RAID, why not use the hard disks as a backup medium? Unlike tape, this backup medium supports instant access to any archived file.
Several new backup software products are taking advantage of filesystem features for backups. By using hard links, they can perform multiple full backups with the efficiency of incremental backups. Check out the Backup PC and rsync backups links in the on-line Resources for more information.
Putting inexpensive disks on the local network is one of those ideas that make you think, “Why hasn't someone done this before?” Only with a simple network protocol, however, is it practical to decouple storage from servers without expensive hardware, and only on a local Ethernet network can a simple network protocol work. On a single Ethernet we don't need the complexity and overhead of a full-fledged Internet protocol such as TCP/IP.
If you're using storage on the local network and if configuring access by creating Ethernet networks is sufficient, then ATA over Ethernet is all you need. If you need features such as encryption, routability and user-based access in the storage protocol, iSCSI also may be of interest.
With ATA over Ethernet, we have a simple alternative that has been conspicuously absent from Linux storage options until now. With simplicity comes possibilities. AoE can be a building block in any storage solution, so let your imagination go, and send me your success stories.
I owe many thanks to Peter Anderson, Brantley Coile and Al Dixon for their helpful feedback. Additional thanks go to Brantley and to Sam Hopkins for developing such a great storage protocol.
Resources for this article: /article/8201.
Ed L. Cashin has wandered through several academic and professional Linux roles since 1997, including Web application developer, system administrator and kernel hacker. He now works at Coraid, where ATA over Ethernet was designed, and he can be reached at firstname.lastname@example.org. He enjoys music and likes to listen to audio books on his way to martial arts classes.