Kernel Korner - ATA Over Ethernet: Putting Hard Drives on the LAN
Listing 2. Setting Up the Software RAID and the LVM Volume Group
# speed up RAID initialization
for f in `find /proc | grep speed`; do
echo 100000 > $f
done
# create mirrors (mdadm will manage hot spares)
mdadm -C /dev/md1 -l 1 -n 2 \
/dev/etherd/e0.0 /dev/etherd/e0.1
mdadm -C /dev/md2 -l 1 -n 2 \
/dev/etherd/e0.2 /dev/etherd/e0.3
mdadm -C /dev/md3 -l 1 -n 2 \
/dev/etherd/e0.4 /dev/etherd/e0.5
mdadm -C /dev/md4 -l 1 -n 2 -x 2 \
/dev/etherd/e0.6 /dev/etherd/e0.7 \
/dev/etherd/e0.8 /dev/etherd/e0.9
sleep 1
# create the stripe over the mirrors
mdadm -C /dev/md0 -l 0 -n 4 \
/dev/md1 /dev/md2 /dev/md3 /dev/md4
# make the RAID 10 into an LVM physical volume
pvcreate /dev/md0
# create an extendible LVM volume group
vgcreate ben /dev/md0
# look at how many "physical extents" there are
vgdisplay ben | grep -i 'free.*PE'
# create a logical volume using all the space
lvcreate --extents 88349 --name franklin ben
modprobe jfs
mkfs -t jfs /dev/ben/franklin
mkdir /bf
mount /dev/ben/franklin /bf
I duplicated Stan's setup on a Debian sarge system with two 2.1GHz Athlon MP processors and 1GB of RAM, using an Intel PRO/1000 MT Dual-Port NIC and puny 40GB drives. The network switch was a Netgear FS526T. With a RAID 10 across eight of the EtherDrive blades in the Coraid shelf, I saw a sustainable read throughput of 23.58MB/s and a write throughput of 17.45MB/s. Each measurement was taken after flushing the page cache by copying a 1GB file to /dev/null, and a sync command was included in the write times.
The RAID 10 in this case has four stripe elements, each one a mirrored pair of drives. In general, you can estimate the throughput of a collection of EtherDrive blades easily by considering how many stripe elements there are. For RAID 10, there are half as many stripe elements as disks, because each disk is mirrored on another disk. For RAID 5, there effectively is one disk dedicated to parity data, leaving the rest of the disks as stripe elements.
The expected read throughput is the number of stripe elements times 6MB/s. That means if Stan bought two shelves initially and constructed an 18-blade RAID 10 instead of his 8-blade RAID 10, he would expect to get a little more than twice the throughput. Stan doesn't need that much throughput, though, and he wanted to start small, with a 1.6TB filesystem.
Listing 3 shows how Stan easily can expand the filesystem when he buys another shelf. The listings don't show Stan's mdadm-aoe.conf file or his startup and shutdown scripts. The mdadm configuration file tells an mdadm process running in monitor mode how to manage the hot spares, so that they're ready to replace any failed disk in any mirror. See spare groups in the mdadm man page.
Listing 3. To expand the filesystem without unmounting it, set up a second RAID 10 array, add it to the volume group and then increase the filesystem.
# after setting up a RAID 10 for the second shelf # as /dev/md5, add it to the volume group vgextend ben /dev/md5 vgdisplay ben | grep -i 'free.*PE' # grow the logical volume and then the jfs lvextend --extents +88349 /dev/ben/franklin mount -o remount,resize /bf
The startup and shutdown scripts are easy to create. The startup script simply assembles each mirrored pair RAID 1, assembles each RAID 0 and starts an mdadm monitor process. The shutdown script stops the mdadm monitor, stops the RAID 0s and, finally, stops the mirrors.
Now that we've seen a concrete example of ATA over Ethernet in action, readers might be wondering what would happen if another host had access to the storage network. Could that second host mount the JFS filesystem and access the same data? The short answer is, “Not safely!” JFS, like ext3 and most filesystems, is designed to be used by a single host. For these single-host filesystems, filesystem corruption can result when multiple hosts mount the same block storage device. The reason is the buffer cache, which is unified with the page cache in 2.6 kernels.
Linux aggressively caches filesystem data in RAM whenever possible in order to avoid using the slower block storage, gaining a significant performance boost. You've seen this caching in action if you've ever run a find command twice on the same directory.
Some filesystems are designed to be used by multiple hosts. Cluster filesystems, as they are called, have some way of making sure that the caches on all of the hosts stay in sync with the underlying filesystem. GFS is a great open-source example. GFS uses cluster management software to keep track of who is in the group of hosts accessing the filesystem. It uses locking to make sure that the different hosts cooperate when accessing the filesystem.
By using a cluster filesystem such as GFS, it is possible for multiple hosts on the Ethernet network to access the same block storage using ATA over Ethernet. There's no need for anything like an NFS server, because each host accesses the storage directly, distributing the I/O nicely. But there's a snag. Any time you're using a lot of disks, you're increasing the chances that one of the disks will fail. Usually you use RAID to take care of this issue by introducing some redundancy. Unfortunately, Linux software RAID is not cluster-aware. That means each host on the network cannot do RAID 10 using mdadm and have things simply work out.
Cluster software for Linux is developing at a furious pace. I believe we'll see good cluster-aware RAID within a year or two. Until then, there are a few options for clusters using AoE for shared block storage. The basic idea is to centralize the RAID functionality. You could buy a Coraid RAIDblade or two and have the cluster nodes access the storage exported by them. The RAIDblades can manage all the EtherDrive blades behind them. Or, if you're feeling adventurous, you also could do it yourself by using a Linux host that does software RAID and exports the resulting disk-failure-proofed block storage itself, by way of ATA over Ethernet. Check out the vblade program (see Resources) for an example of software that exports any storage using ATA over Ethernet.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- RSS Feeds
- Validate an E-Mail Address with PHP, the Right Way
- Readers' Choice Awards
- Tech Tip: Really Simple HTTP Server with Python
- DynDNS
25 min 6 sec ago - Reply to comment | Linux Journal
57 min 28 sec ago - All the articles you talked
3 hours 21 min ago - All the articles you talked
3 hours 24 min ago - All the articles you talked
3 hours 25 min ago - myip
7 hours 50 min ago - Keeping track of IP address
9 hours 41 min ago - Roll your own dynamic dns
14 hours 54 min ago - Please correct the URL for Salt Stack's web site
18 hours 6 min ago - Android is Linux -- why no better inter-operation
20 hours 21 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Definitely very help full was
Definitely very help full was kind of looking into doing this on linux and now I'm pretty positive that I can handle it.
distributed network raid configuration
there are redundant packets sent to the same shelf with mirrored disks as described in your post. this will saturate the switch ports unnecessarily.
consider distributing the raid 1 mirrors between two or more shelves as follows:
mdadm -C /dev/md1 -l 1 -n 2 \
/dev/etherd/e0.0 /dev/etherd/e1.0
mdadm -C /dev/md2 -l 1 -n 2 \
/dev/etherd/e0.1 /dev/etherd/e1.1
mdadm -C /dev/md3 -l 1 -n 2 \
/dev/etherd/e0.2 /dev/etherd/e1.2
mdadm -C /dev/md4 -l 1 -n 2 -x 2 \
/dev/etherd/e0.3 /dev/etherd/e1.3 \
/dev/etherd/e0.4 /dev/etherd/e1.4
then stripe those mirrors as previously suggested...
mdadm -C /dev/md0 -l 0 -n 4 \
/dev/md1 /dev/md2 /dev/md3 /dev/md4
considering the server may be bonded to gigabit ethernet uplinks in a round-robin or similar configuration, the switch will saturate each of the fast ethernet ports dedicated the shelves before saturating the server uplinks.
the other advantage to a distributed raid mirror occurs when a single shelf fails. all of the drives are mirrored on another shelf, therefore it's business as usual for the server.
with the improvements mentioned above you get both improved throughput during reading and writing, as well as a more robust system that continues to run despite multiple disk or single shelf failures.
cheers! ;-)
It works in my lab!
That doesn't make it enterprise.
Out of order packets aren't the silent killer here.
faulty checksum hardware will silenty allow corruption of your data.
Cheap NICs can kill your data, silently and thoroughly.
Fsck early. Fsck often.
References ---
google: tcp checksum hardware error
of particular note:
http://portal.acm.org/citation.cfm?doid=347059.347561
To quote: "Even so, the highly non-random distribution of errors strongly suggests some applications should employ application-level checksums or equivalents."
I guess the Coraid folks don't have google.
"I guess the Coraid folks
"I guess the Coraid folks don't have Google." ?? I guess Davester can't read.
What's the relevance of TCP checksum/CRC issues when this is all done at layer 2 and TCP isn't even involved? Here - let me answer that for you: NONE.
As noted in the article, avoiding TCP also avoids a lot of other issues. This is a layer 2 (Ethernet) solution. No TCP. No UDP. No IP. That's the lovely simplicity of this solution.
maybe zfs is the answer
zfs does checksums for all blocks so there won't be any silent corruptions
Single write multiple read
"Given linux software RAID is not cluster-aware you cannot share the array between multiple AoE clients".
I presume this is only in the case with multiple writing clients?
Is it therefore possible to have a single write client but any number of read clients accessing the array via AoE ? Are there any examples/users doing this ?
great article by the way..
regards
Al
Nope, you need a cluster
Nope, you need a cluster aware FS like GFS or CXFS even for 1 writer and multiple readers.
Packet-ordering dependent?
Hi,
Aren't you hosed if the switch decides to deliver frames out-of-order? Is there anything in the protocol that dictates ordering at the frame level?
Thanks,
--S
Packet-ordering dependent?
It is a requirement / feature of Ethernet / IEEE 802.3 that packets are not re-ordered. It is also a requirement that packets that are delivered are error free within the capabilities of the 32 bit checksum. It is not a requirement (of the connectionless mode of operation) that all packets are delivered so there must be a retransmission / error correction mechanism.
Out of order
There is a complicated answer to this but as an armwaving simple case the answer is "no". The Linux block layer will not issue an overlapping write to a device until the previous write covering that sector has completed. In fact usually it'll merge them together.
Don't know, but I don't
Don't know, but I don't think it's a big problem... recent SATA drives have Native command queueing; which reorders the commands in it's buffer to increase performance.
Cluster-aware RAID
Nice article. Am a relative new comer in the field of storage. Could you please explain what you meant by the term cluster-aware RAID? Is there currently any implementation of it?
cluster aware RAID
A cluster aware RAID would be a block device driver that cooperates while writing to the RAID/rebuilding the RAID with other hosts.
Andreas
Good article
Over a year later this article is still relevant and informative. Thanks.
software used in read/write tests
What exactly did you use to perform the read/write tests? If it's just a simple shell script, would you mind pasting it here? I'd assume you used hdparm -Tt, except IIRC this doesn't do any write tests.
Great article!
--
Adam Monsen
Could someone describe the di
Could someone describe the differences between AoE and Netblock Devices(nbd) Thanks.
AoE and nbd
AoE is a network protocol for ethernet. The aoe driver for Linux allows AoE storage devices (targets) to be usable as local block devices.
nbd is not a network protocol but a Linux feature. It's analogous to the aoe driver, not the AoE network protocol. Instead of AoE, it uses TCP over IP as the network protocol for transmitting information and data.
TCP is more complex than AoE. AoE can be implemented by low-cost hardware.
AoE is not a routable protocol, so for using remote storage devices over long, unreliable network links, nbd (using TCP) might be a nice choice. On the other hand, AoE is great for using nearby storage devices. Interestingly, AoE could be tunneled through other protocols (like TCP), or even encrypted sessions.
Sharing drives over AoE
Is it possible to use an old box and share his drive over AoE? One could use some older machines and build a disk array for a more powerfull machine.
Yes, there's an AoE target th
Yes, there's an AoE target that runs in user space:
http://freshmeat.net/projects/vblade
... with which you could export any file or block
device using ATA over Ethernet.
But for the application you're considering, it
sounds like PVFS is just the thing.
http://www.parl.clemson.edu/pvfs/
Each host has some storage, and all the hosts
communicate in order to share the storage efficiently
to create a large, fast filesystem.
AoE target as loadable module
there is also an AoE target that runs in kernel space now:
http://lpk.com.price.ru/~lelik/AoE/
unfortunately, it doesn`t seem to be documented very well
vblade user mode
vblade user mode implementation is slow (vblade 100% CPU Athlon64-3200 and gigabit ethernet). Client and server using ubuntu6 desktop
from server
single sata disk
hdparm -tT /dev/sda = 58MB/s
from client pc (p4 3GHz) hdparm -tT /dev/etherd/e0.0 = 50MB/s
-----------------
from server
raid0 sata disk
hdparm -tT /dev/md0 = 115MB/s
from client pc (p4 3GHz) hdparm -tT /dev/etherd/e0.0 = !!!75MB/s!!!
-----------------
Thanks for your benchmark
Thanks for your benchmark numbers!
2 nice facts are included in your posting:
1) It seems that there is a waste of 8 MB/s for ATA over ethernet (for your first benchmark)
2) You are hitting the troughput limit of your gigabit ethernet NIC here (75 MB/s is a very good value [when I benchmarked sometime ago 3 GbE NICs, the fastest NIC was an IntelPro with about 78 MB/s max throughput])
I'm wondering if channelbonding would help here !? With 2 GbE NICs per machine the troughput should be again at ~150 MB/s (or 115 MB/s for your disks). Sadly this concept won't scale very well :(
See:
http://www.howtoforge.com/nic_bonding
Bonding
I use bonding in my clusters. I would suspect that the bonding will not give you much of a benchmark increase, but should provide a more constant higher access rate under heavy work loads( lots of users, heavy video editing, webservering) than a single nic.
holy comment spam, batman.
holy comment spam, batman. Why aren't you guys at least using capcha?