Kernel Korner - ATA Over Ethernet: Putting Hard Drives on the LAN
The following example is based on a true story. Stan is a fictional sysadmin working for the state government. New state legislation requires that all official documents be archived permanently. Any state resident can demand to see any official document at any time. Stan therefore needs a huge storage capacity that can grow without bounds. The performance of the storage needn't be any better than a local ATA disk, though. He wants all of the data to be retrievable easily and immediately.
Stan is comfortable with Ethernet networking and Linux system administration, so he decides to try ATA over Ethernet. He buys some equipment, paying a bit less than $6,500 US for all of the following:
One dual-port gigabit Ethernet card to replace the old 100Mb card in his server.
One 26-port network switch with two gigabit ports.
One Coraid EtherDrive shelf and ten EtherDrive blades.
Ten 400GB ATA drives.
The shelf of ten blades takes up three rack units. Each EtherDrive blade is a small computer that performs the AoE protocol to effectively put one ATA disk on the LAN. Striping data over the ten blades in the shelf results in about the throughput of a local ATA drive, so the gigabit link helps to use the throughput effectively. Although he could have put the EtherDrive blades on the same network as everyone else, he has decided to put the storage on its own network, connected to the server's second network interface, eth1, for security and performance.
Stan reads the Linux Software RAID HOWTO (see the on-line Resources) and decides to use a RAID 10—striping over mirrored pairs—configuration. Although this configuration doesn't result in as much usable capacity as a RAID 5 configuration, RAID 10 maximizes reliability, minimizes the CPU cost of performing RAID and has a shorter array re-initialization time if one disk should fail.
After reading the LVM HOWTO (see Resources), Stan comes up with a plan to avoid ever running out of disk space. JFS is a filesystem that can grow dynamically to large sizes, so he is going to put a JFS filesystem on a logical volume. The logical volume resides, for now, on only one physical volume. That physical volume is the RAID 10 block device. The RAID 10 is created from the EtherDrive storage blades in the Coraid shelf using Linux software RAID. Later, he can buy another full shelf, create another RAID 10, make it into a physical volume and use the new physical volume to extend the logical volume where his JFS lives.
Listing 1 shows the commands Stan uses to prepare his server for doing ATA over Ethernet. He builds the AoE driver with AOE_PARTITIONS=1, because he's using a Debian sarge system running a 2.6 kernel. Sarge doesn't support large minor device numbers yet (see the Minor Numbers sidebar), so he turns off disk partitioning support in order to be able to use more disks. Also, because of Debian bug 292070, Stan installs the latest device mapper and LVM2 userland software.
Listing 1. The first step in building a software RAID device from several AoE drives is setting up AoE.
# setting up the host for AoE # build and install the AoE driver tar xvfz aoe-2.6-5.tar.gz cd aoe-2.6-5 make AOE_PARTITIONS=1 install # AoE needs no IP addresses! :) ifconfig eth1 up # let the network interface come up sleep 5 # load the ATA over Ethernet driver modprobe aoe # see what aoe disks are available aoe-stat
Minor Device Numbers
A program that wants to use a device typically does so by opening a special file corresponding to that device. A familiar example is the /dev/hda file. An ls -l command shows two numbers for /dev/hda, 3 and 0. The major number is 3 and the minor number is 0. The /dev/hda1 file has a minor number of 1, and the major number is still 3.
Until kernel 2.6, the minor number was eight bits in size, limiting the possible minor numbers to 0 through 255. Nobody had that many devices, so the limitation didn't matter. Now that disks have been decoupled from servers, it does matter, and kernel 2.6 uses 20 bits for the minor device number.
Having 1,048,576 values for the minor number is a big help to systems that use many devices, but not all software has caught up. If glibc or a specific application still thinks of minor numbers as eight bits in size, you are going to have trouble using minor device numbers over 255.
To help during this transitional period, the AoE driver may be compiled without support for partitions. That way, instead of there being 16 minor numbers per disk, there's only one per disk. So even on systems that haven't caught up to the large minor device numbers of 2.6, you still can use up to 256 AoE disks.
The commands for creating the filesystem and its logical volume are shown in Listing 2. Stan decides to name the volume group ben and the logical volume franklin. LVM2 now needs a couple of tweaks made to its configuration. For one, it needs a line with types = [ "aoe", 16 ] so that LVM recognizes AoE disks. Next, it needs md_component_detection = 1, so the disks inside RAID 10 are ignored when the whole RAID 10 becomes a physical volume.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Linux Systems Administrator
- Validate an E-Mail Address with PHP, the Right Way
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Introduction to MapReduce with Hadoop on Linux
- RSS Feeds
- Bought photoshop CS5 for developing a website :(
2 hours 45 min ago - What the author describes
4 hours 11 min ago - Reply to comment | Linux Journal
8 hours 21 min ago - Reply to comment | Linux Journal
9 hours 6 min ago - Didn't read
9 hours 17 min ago - Reply to comment | Linux Journal
9 hours 22 min ago - Poul-Henning Kamp: welcome to
11 hours 32 min ago - This has already been done
11 hours 33 min ago - Reply to comment | Linux Journal
12 hours 18 min ago - Welcome to 1998
13 hours 7 min ago
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Definitely very help full was
Definitely very help full was kind of looking into doing this on linux and now I'm pretty positive that I can handle it.
distributed network raid configuration
there are redundant packets sent to the same shelf with mirrored disks as described in your post. this will saturate the switch ports unnecessarily.
consider distributing the raid 1 mirrors between two or more shelves as follows:
mdadm -C /dev/md1 -l 1 -n 2 \
/dev/etherd/e0.0 /dev/etherd/e1.0
mdadm -C /dev/md2 -l 1 -n 2 \
/dev/etherd/e0.1 /dev/etherd/e1.1
mdadm -C /dev/md3 -l 1 -n 2 \
/dev/etherd/e0.2 /dev/etherd/e1.2
mdadm -C /dev/md4 -l 1 -n 2 -x 2 \
/dev/etherd/e0.3 /dev/etherd/e1.3 \
/dev/etherd/e0.4 /dev/etherd/e1.4
then stripe those mirrors as previously suggested...
mdadm -C /dev/md0 -l 0 -n 4 \
/dev/md1 /dev/md2 /dev/md3 /dev/md4
considering the server may be bonded to gigabit ethernet uplinks in a round-robin or similar configuration, the switch will saturate each of the fast ethernet ports dedicated the shelves before saturating the server uplinks.
the other advantage to a distributed raid mirror occurs when a single shelf fails. all of the drives are mirrored on another shelf, therefore it's business as usual for the server.
with the improvements mentioned above you get both improved throughput during reading and writing, as well as a more robust system that continues to run despite multiple disk or single shelf failures.
cheers! ;-)
It works in my lab!
That doesn't make it enterprise.
Out of order packets aren't the silent killer here.
faulty checksum hardware will silenty allow corruption of your data.
Cheap NICs can kill your data, silently and thoroughly.
Fsck early. Fsck often.
References ---
google: tcp checksum hardware error
of particular note:
http://portal.acm.org/citation.cfm?doid=347059.347561
To quote: "Even so, the highly non-random distribution of errors strongly suggests some applications should employ application-level checksums or equivalents."
I guess the Coraid folks don't have google.
"I guess the Coraid folks
"I guess the Coraid folks don't have Google." ?? I guess Davester can't read.
What's the relevance of TCP checksum/CRC issues when this is all done at layer 2 and TCP isn't even involved? Here - let me answer that for you: NONE.
As noted in the article, avoiding TCP also avoids a lot of other issues. This is a layer 2 (Ethernet) solution. No TCP. No UDP. No IP. That's the lovely simplicity of this solution.
maybe zfs is the answer
zfs does checksums for all blocks so there won't be any silent corruptions
Single write multiple read
"Given linux software RAID is not cluster-aware you cannot share the array between multiple AoE clients".
I presume this is only in the case with multiple writing clients?
Is it therefore possible to have a single write client but any number of read clients accessing the array via AoE ? Are there any examples/users doing this ?
great article by the way..
regards
Al
Nope, you need a cluster
Nope, you need a cluster aware FS like GFS or CXFS even for 1 writer and multiple readers.
Packet-ordering dependent?
Hi,
Aren't you hosed if the switch decides to deliver frames out-of-order? Is there anything in the protocol that dictates ordering at the frame level?
Thanks,
--S
Packet-ordering dependent?
It is a requirement / feature of Ethernet / IEEE 802.3 that packets are not re-ordered. It is also a requirement that packets that are delivered are error free within the capabilities of the 32 bit checksum. It is not a requirement (of the connectionless mode of operation) that all packets are delivered so there must be a retransmission / error correction mechanism.
Out of order
There is a complicated answer to this but as an armwaving simple case the answer is "no". The Linux block layer will not issue an overlapping write to a device until the previous write covering that sector has completed. In fact usually it'll merge them together.
Don't know, but I don't
Don't know, but I don't think it's a big problem... recent SATA drives have Native command queueing; which reorders the commands in it's buffer to increase performance.
Cluster-aware RAID
Nice article. Am a relative new comer in the field of storage. Could you please explain what you meant by the term cluster-aware RAID? Is there currently any implementation of it?
cluster aware RAID
A cluster aware RAID would be a block device driver that cooperates while writing to the RAID/rebuilding the RAID with other hosts.
Andreas
Good article
Over a year later this article is still relevant and informative. Thanks.
software used in read/write tests
What exactly did you use to perform the read/write tests? If it's just a simple shell script, would you mind pasting it here? I'd assume you used hdparm -Tt, except IIRC this doesn't do any write tests.
Great article!
--
Adam Monsen
Could someone describe the di
Could someone describe the differences between AoE and Netblock Devices(nbd) Thanks.
AoE and nbd
AoE is a network protocol for ethernet. The aoe driver for Linux allows AoE storage devices (targets) to be usable as local block devices.
nbd is not a network protocol but a Linux feature. It's analogous to the aoe driver, not the AoE network protocol. Instead of AoE, it uses TCP over IP as the network protocol for transmitting information and data.
TCP is more complex than AoE. AoE can be implemented by low-cost hardware.
AoE is not a routable protocol, so for using remote storage devices over long, unreliable network links, nbd (using TCP) might be a nice choice. On the other hand, AoE is great for using nearby storage devices. Interestingly, AoE could be tunneled through other protocols (like TCP), or even encrypted sessions.
Sharing drives over AoE
Is it possible to use an old box and share his drive over AoE? One could use some older machines and build a disk array for a more powerfull machine.
Yes, there's an AoE target th
Yes, there's an AoE target that runs in user space:
http://freshmeat.net/projects/vblade
... with which you could export any file or block
device using ATA over Ethernet.
But for the application you're considering, it
sounds like PVFS is just the thing.
http://www.parl.clemson.edu/pvfs/
Each host has some storage, and all the hosts
communicate in order to share the storage efficiently
to create a large, fast filesystem.
AoE target as loadable module
there is also an AoE target that runs in kernel space now:
http://lpk.com.price.ru/~lelik/AoE/
unfortunately, it doesn`t seem to be documented very well
vblade user mode
vblade user mode implementation is slow (vblade 100% CPU Athlon64-3200 and gigabit ethernet). Client and server using ubuntu6 desktop
from server
single sata disk
hdparm -tT /dev/sda = 58MB/s
from client pc (p4 3GHz) hdparm -tT /dev/etherd/e0.0 = 50MB/s
-----------------
from server
raid0 sata disk
hdparm -tT /dev/md0 = 115MB/s
from client pc (p4 3GHz) hdparm -tT /dev/etherd/e0.0 = !!!75MB/s!!!
-----------------
Thanks for your benchmark
Thanks for your benchmark numbers!
2 nice facts are included in your posting:
1) It seems that there is a waste of 8 MB/s for ATA over ethernet (for your first benchmark)
2) You are hitting the troughput limit of your gigabit ethernet NIC here (75 MB/s is a very good value [when I benchmarked sometime ago 3 GbE NICs, the fastest NIC was an IntelPro with about 78 MB/s max throughput])
I'm wondering if channelbonding would help here !? With 2 GbE NICs per machine the troughput should be again at ~150 MB/s (or 115 MB/s for your disks). Sadly this concept won't scale very well :(
See:
http://www.howtoforge.com/nic_bonding
Bonding
I use bonding in my clusters. I would suspect that the bonding will not give you much of a benchmark increase, but should provide a more constant higher access rate under heavy work loads( lots of users, heavy video editing, webservering) than a single nic.
holy comment spam, batman.
holy comment spam, batman. Why aren't you guys at least using capcha?