Kernel Korner - ATA Over Ethernet: Putting Hard Drives on the LAN

With ATA hard drives now being cheaper than tape, this simple new storage technology enables you to build storage arrays for archives, backup or live use.

Everybody runs out of disk space at some time. Fortunately, hard drives keep getting larger and cheaper. Even so, the more disk space there is, the more we use, and soon we run out again.

Some kinds of data are huge by nature. Video, for example, always takes up a lot of space. Businesses often need to store video data, especially with digital surveillance becoming more common. Even at home, we enjoy watching and making movies on our computers.

Backup and data redundancy are essential to any business using computers. It seems no matter how much storage capacity there is, it always would be nice to have more. Even e-mail can overgrow any container we put it in, as Internet service providers know too well.

Unlimited storage becomes possible when the disks come out of the box, decoupling the storage from the computer that's using it. The principle of decoupling related components to achieve greater flexibility shows up in many domains, not only data storage. Modular source code can be used more flexibly to meet unforeseen needs, and a stereo system made from components can be used in more interesting configurations than an all-in-one stereo box can be.

The most familiar example of out-of-the-box storage probably is the storage area network (SAN). I remember when SANs started to create a buzz; it was difficult to work past the hype and find out what they really were. When I finally did, I was somewhat disappointed to find that SANs were complex, proprietary and expensive.

In supporting these SANs, though, the Linux community has made helpful changes to the kernel. The enterprise versions of 2.4 kernel releases informed the development of new features of the 2.6 kernel, and today's stable kernel has many abilities we lacked only a few years ago. It can use huge block devices, well over the old limit of two terabytes. It can support many more simultaneously connected disks. There's also support for sophisticated storage volume management. In addition, filesystems now can grow to huge sizes, even while mounted and in use.

This article describes a new way to leverage these new kernel features, taking disks out of the computer and overcoming previous limits on storage use and capacity. You can think of ATA over Ethernet (AoE) as a way to replace your IDE cable with an Ethernet network. With the storage decoupled from the computer and the flexibility of Ethernet between the two, the possibilities are limited only by your imagination and willingness to learn new things.

What Is AoE?

ATA over Ethernet is a network protocol registered with the IEEE as Ethernet protocol 0x88a2. AoE is low level, much simpler than TCP/IP or even IP. TCP/IP and IP are necessary for the reliable transmission of data over the Internet, but the computer has to work harder to handle the complexity they introduce.

Users of iSCSI have noticed this issue with TCP/IP. iSCSI is a way to send I/O over TCP/IP, so that inexpensive Ethernet equipment may be used instead of Fibre Channel equipment. Many iSCSI users have started buying TCP offload engines (TOE). These TOE cards are expensive, but they remove the burden of doing TCP/IP from the machines using iSCSI.

An interesting observation is that most of the time, iSCSI isn't actually used over the Internet. If the packets simply need to go to a machine in the rack next door, the heavyweight TCP/IP protocol seems like overkill.

So instead of offloading TCP/IP, why not dispense with it altogether? The ATA over Ethernet protocol does exactly that, taking advantage of today's smart Ethernet switches. A modern switch has flow control, maximizing throughput and limiting packet collisions. On the local area network (LAN), packet order is preserved, and each packet is checksummed for integrity by the networking hardware.

Each AoE packet carries a command for an ATA drive or the response from the ATA drive. The AoE Linux kernel driver performs AoE and makes the remote disks available as normal block devices, such as /dev/etherd/e0.0—just as the IDE driver makes the local drive at the end of your IDE cable available as /dev/hda. The driver retransmits packets when necessary, so the AoE devices look like any other disks to the rest of the kernel.

In addition to ATA commands, AoE has a simple facility for identifying available AoE devices using query config packets. That's all there is to it: ATA command packets and query config packets.

Anyone who has worked with or learned about SANs likely wonders at this point, “If all the disks are on the LAN, then how can I limit access to the disks?” That is, how can I make sure that if machine A is compromised, machine B's disks remain safe?

The answer is that AoE is not routable. You easily can determine what computers see what disks by setting up ad hoc Ethernet networks. Because AoE devices don't have IP addresses, it is trivial to create isolated Ethernet networks. Simply power up a switch and start plugging in things. In addition, many switches these days have a port-based VLAN feature that allows a switch effectively to be partitioned into separate, isolated broadcast domains.

The AoE protocol is so lightweight that even inexpensive hardware can use it. At this time, Coraid is the only vendor of AoE hardware, but other hardware and software developers should be pleased to find that the AoE specification is only eight pages in length. This simplicity is in stark contrast to iSCSI, which is specified in hundreds of pages, including the specification of encryption features, routability, user-based access and more. Complexity comes at a price, and now we can choose whether we need the complexity or would prefer to avoid its cost.

Simple primitives can be powerful tools. It may not come as a surprise to Linux users to learn that even with the simplicity of AoE, a bewildering array of possibilities present themselves once the storage can reside on the network. Let's start with a concrete example and then discuss some of the possibilities.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Definitely very help full was

Ace Winget's picture

Definitely very help full was kind of looking into doing this on linux and now I'm pretty positive that I can handle it.

distributed network raid configuration

pr0mjr's picture

there are redundant packets sent to the same shelf with mirrored disks as described in your post. this will saturate the switch ports unnecessarily.

consider distributing the raid 1 mirrors between two or more shelves as follows:

mdadm -C /dev/md1 -l 1 -n 2 \
/dev/etherd/e0.0 /dev/etherd/e1.0
mdadm -C /dev/md2 -l 1 -n 2 \
/dev/etherd/e0.1 /dev/etherd/e1.1
mdadm -C /dev/md3 -l 1 -n 2 \
/dev/etherd/e0.2 /dev/etherd/e1.2
mdadm -C /dev/md4 -l 1 -n 2 -x 2 \
/dev/etherd/e0.3 /dev/etherd/e1.3 \
/dev/etherd/e0.4 /dev/etherd/e1.4

then stripe those mirrors as previously suggested...

mdadm -C /dev/md0 -l 0 -n 4 \
/dev/md1 /dev/md2 /dev/md3 /dev/md4

considering the server may be bonded to gigabit ethernet uplinks in a round-robin or similar configuration, the switch will saturate each of the fast ethernet ports dedicated the shelves before saturating the server uplinks.

the other advantage to a distributed raid mirror occurs when a single shelf fails. all of the drives are mirrored on another shelf, therefore it's business as usual for the server.

with the improvements mentioned above you get both improved throughput during reading and writing, as well as a more robust system that continues to run despite multiple disk or single shelf failures.

cheers! ;-)

It works in my lab!

Davester's picture

That doesn't make it enterprise.

Out of order packets aren't the silent killer here.
faulty checksum hardware will silenty allow corruption of your data.
Cheap NICs can kill your data, silently and thoroughly.
Fsck early. Fsck often.

References ---

google: tcp checksum hardware error

of particular note:

To quote: "Even so, the highly non-random distribution of errors strongly suggests some applications should employ application-level checksums or equivalents."

I guess the Coraid folks don't have google.

"I guess the Coraid folks

Ziggy Stardust's picture

"I guess the Coraid folks don't have Google." ?? I guess Davester can't read.

What's the relevance of TCP checksum/CRC issues when this is all done at layer 2 and TCP isn't even involved? Here - let me answer that for you: NONE.

As noted in the article, avoiding TCP also avoids a lot of other issues. This is a layer 2 (Ethernet) solution. No TCP. No UDP. No IP. That's the lovely simplicity of this solution.

maybe zfs is the answer

Anonymous's picture

zfs does checksums for all blocks so there won't be any silent corruptions

Single write multiple read

Al's picture

"Given linux software RAID is not cluster-aware you cannot share the array between multiple AoE clients".

I presume this is only in the case with multiple writing clients?

Is it therefore possible to have a single write client but any number of read clients accessing the array via AoE ? Are there any examples/users doing this ?

great article by the way..


Nope, you need a cluster

Anonymous's picture

Nope, you need a cluster aware FS like GFS or CXFS even for 1 writer and multiple readers.

Packet-ordering dependent?

Anonymous's picture


Aren't you hosed if the switch decides to deliver frames out-of-order? Is there anything in the protocol that dictates ordering at the frame level?


Packet-ordering dependent?

eclectic's picture

It is a requirement / feature of Ethernet / IEEE 802.3 that packets are not re-ordered. It is also a requirement that packets that are delivered are error free within the capabilities of the 32 bit checksum. It is not a requirement (of the connectionless mode of operation) that all packets are delivered so there must be a retransmission / error correction mechanism.

Out of order

AlanCox's picture

There is a complicated answer to this but as an armwaving simple case the answer is "no". The Linux block layer will not issue an overlapping write to a device until the previous write covering that sector has completed. In fact usually it'll merge them together.

Don't know, but I don't

Anonymous's picture

Don't know, but I don't think it's a big problem... recent SATA drives have Native command queueing; which reorders the commands in it's buffer to increase performance.

Cluster-aware RAID

Anonymous's picture

Nice article. Am a relative new comer in the field of storage. Could you please explain what you meant by the term cluster-aware RAID? Is there currently any implementation of it?

cluster aware RAID

yacc's picture

A cluster aware RAID would be a block device driver that cooperates while writing to the RAID/rebuilding the RAID with other hosts.


Good article

Andrew's picture

Over a year later this article is still relevant and informative. Thanks.

software used in read/write tests

Anonymous's picture

What exactly did you use to perform the read/write tests? If it's just a simple shell script, would you mind pasting it here? I'd assume you used hdparm -Tt, except IIRC this doesn't do any write tests.

Great article!

Adam Monsen

Could someone describe the di

Adrian's picture

Could someone describe the differences between AoE and Netblock Devices(nbd) Thanks.

AoE and nbd

Anonymous's picture

AoE is a network protocol for ethernet. The aoe driver for Linux allows AoE storage devices (targets) to be usable as local block devices.

nbd is not a network protocol but a Linux feature. It's analogous to the aoe driver, not the AoE network protocol. Instead of AoE, it uses TCP over IP as the network protocol for transmitting information and data.

TCP is more complex than AoE. AoE can be implemented by low-cost hardware.

AoE is not a routable protocol, so for using remote storage devices over long, unreliable network links, nbd (using TCP) might be a nice choice. On the other hand, AoE is great for using nearby storage devices. Interestingly, AoE could be tunneled through other protocols (like TCP), or even encrypted sessions.

Sharing drives over AoE

dchouinard's picture

Is it possible to use an old box and share his drive over AoE? One could use some older machines and build a disk array for a more powerfull machine.

Yes, there's an AoE target th

Anonymous's picture

Yes, there's an AoE target that runs in user space:

... with which you could export any file or block
device using ATA over Ethernet.

But for the application you're considering, it
sounds like PVFS is just the thing.

Each host has some storage, and all the hosts
communicate in order to share the storage efficiently
to create a large, fast filesystem.

AoE target as loadable module

Anonymous's picture

there is also an AoE target that runs in kernel space now:

unfortunately, it doesn`t seem to be documented very well

vblade user mode

Elix's picture

vblade user mode implementation is slow (vblade 100% CPU Athlon64-3200 and gigabit ethernet). Client and server using ubuntu6 desktop

from server
single sata disk
hdparm -tT /dev/sda = 58MB/s

from client pc (p4 3GHz) hdparm -tT /dev/etherd/e0.0 = 50MB/s
from server
raid0 sata disk
hdparm -tT /dev/md0 = 115MB/s

from client pc (p4 3GHz) hdparm -tT /dev/etherd/e0.0 = !!!75MB/s!!!

Thanks for your benchmark

Art's picture

Thanks for your benchmark numbers!

2 nice facts are included in your posting:
1) It seems that there is a waste of 8 MB/s for ATA over ethernet (for your first benchmark)
2) You are hitting the troughput limit of your gigabit ethernet NIC here (75 MB/s is a very good value [when I benchmarked sometime ago 3 GbE NICs, the fastest NIC was an IntelPro with about 78 MB/s max throughput])

I'm wondering if channelbonding would help here !? With 2 GbE NICs per machine the troughput should be again at ~150 MB/s (or 115 MB/s for your disks). Sadly this concept won't scale very well :(



Evan's picture

I use bonding in my clusters. I would suspect that the bonding will not give you much of a benchmark increase, but should provide a more constant higher access rate under heavy work loads( lots of users, heavy video editing, webservering) than a single nic.

holy comment spam, batman.

Anonymous's picture

holy comment spam, batman. Why aren't you guys at least using capcha?

One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix