Use AoE to Build Your Own SAN
When I first heard of the ATA over Ethernet (AoE) protocol, I got excited about the possibilities. Sending ATA commands directly over an Ethernet physical layer offers tremendous simplicity, flexibility and low overhead that potentially could result in astonishing performance. As the rumors of official acceptance into the Linux kernel grew louder, I waited with bated breath. I believed this would be game-changing for the storage market. When it made its way into the official kernel tree and 2.6.11 was released in early 2005 with built-in AoE support, it was all I could do not to stand up and cheer.
The obvious question is what possibly could warrant such excitement? We already had Fibre Channel and iSCSI. Why throw another technology in? Storage was almost a completely mature market, right? You might as well ask why people got excited about Voice over IP (VoIP) not long before that. The farsighted who understood the economics of open source could see where this was going to take us, and that place is disruptive. We were going to move the bar to a price point traditional vendors would find impossible to match, and it would all get turned on its head—just like VoIP. VoIP makes providing yourself with low-cost high-quality phone services essentially free if you are nerdy enough, and it allows you to become a telecom for small businesses (the business that I'm in). Similarly, AoE makes SAN available to anyone, allowing someone with some technical skill and a little capital to compete with the likes of EMC and IBM. You can, right now, literally build your own SAN with AoE initiator and target for less than $100 dollars. See the how-to in the Resources section to do just that.
Essentially, AoE is an open-standards-based protocol that allows direct network access to disk drives by client hosts. That means AoE allows you to let a disk or RAID array on one box be interacted with by the IDE/ATA/SATA driver on another box, using the ATA commands that SATA disks use natively to do it, only over an Ethernet network—and to do it very efficiently. You build an array on one box and export it as a block device that multiple clients can see and access. You can export a disk, an array, a single partition, a disk or array split up into multiple partitions, or even a loopback device that contains an encrypted block device to any number of clients. Drivers for targets and initiators (hosts and clients) exist for all major operating systems, and it is native in the mainstream Linux kernel now.
Now, imagine that you could build a box designed specifically to be an AoE target and export your arrays as block devices over your Ethernet network. Imagine that you optimized a kernel to do that and only that, put enough processor and memory in a shelf-style chassis running your optimized kernel and AoE target driver and found that by so doing, you could throw in commodity disks and blow away the performance of Fibre Channel SAN at a fraction of the cost. Imagine that it was simple enough to make complex redundant designs easy to build and manage. Now imagine someone already did all that. Now open your eyes and realize that it's all true. Coraid has been fine-tuning its product offerings and stands poised to revolutionize the price to performance and manageability of network storage. The latest results show at least a five-to-one price-to-performance advantage over other technologies.
Being a consummate Linux Nerd, I found that using command-line tools to configure things a comforting respite from convoluted GUI tools from other vendors. Originally, with Coraid's gear, you could use only SATA and SATA II disks, which, although more than adequate for the vast majority of applications, was admittedly a bit of a disadvantage over SCSI-based solutions that could take advantage of SCSI's higher RPM and throughput performance along with the higher MTBF and component reliability that comes from a SCSI disk's tighter manufacturing tolerances. Coraid was nice enough to loan me some demo gear to test its next generation of solutions.
I discovered that Coraid's latest SRX series of shelves allow you to use a wider variety of interfaces, including SSD and SAS with 6GB SAS on the horizon, which makes higher-end drives available for your architecture for those applications that require it. Using 16 15k rpm SAS drives in RAID 10, I was not able to devise a test that even challenged the demo gear I had performance-wise. To see what it can do aggregating four 10-gigabit or six 1-gigabit ports, check out independent lab ESG's report found in the Resources section. It's a touch mind-boggling. I also learned that Coraid is building a point-and-click GUI interface to sit on top of the CLI, but honestly, I didn't care as long as it doesn't take my geek-friendly command line from me.
It is worth stating that some security groups have attacked the relative simplicity of the AoE protocol and asserted this makes it “insecure”, because it has no strong authentication mechanisms and potentially could be hijacked or subjected to other hijinks with simple techniques like MAC address spoofing. The saving grace is that AoE is not routable, meaning that people would have to plug their malicious host physically, directly in to your Ethernet LAN segment in order to be a threat, so the security of your AoE architecture is entirely dependent on the physical security of your Ethernet switches. So, this will be important to keep in mind in terms of physical placement and access to your gear. AoE rides on a lower network layer than IP, and the IP layer is what makes TCP/IP and the Internet routable, but I strongly suspect that if I needed to export an AoE block device farther than I could reach with a strand of fiber and media converters, that I could work something out with GRE (Generic Route Encapsulation) and VPN, although this is not something I've tested. Also keep in mind that when your clients see the AoE block devices, nothing in the AoE protocol stops you from mounting your filesystems on multiple machines, but with most filesystems, this will have devastating effects on your data integrity. So, it is up to you to build in precautions against this.
AoE overlaps with Fibre Channel and iSCSI in many ways. Its main advantage over them lies in its simplicity. The AoE specification is 12 pages long, while iSCSI clocks in at 276 pages and Fibre Channel is no longer than even a single specification, but has multiple versions in multiple parts, each hundreds of pages long with the newest releases breaking all backward-compatibility. If you want to perform an exercise in brain-bending masochism, attempt to build a Fibre Channel SAN with a mere five clients, no single point of failure and automated failover between all components. Not only is there a good chance you will go bankrupt and still be unable to do it, the complexity will make it virtually impossible to scale and a nightmare you can't wake from to manage. iSCSI would make that task somewhat easier and less expensive, but you would find performance suffers as the overhead of TCP/IP header processing drags down your solution. For finding the sweet spot of manageability, flexibility and performance, you won't beat AoE gear—that's before you even take affordability into account.
To demonstrate that, I'm going to take you through how I built a “never go down” solution for our environment using Coraid's older SR shelves and CLN21 failover kit. Coraid is discontinuing the exact front-end gear that I bought, but building the equivalent is not terribly difficult and how-tos already exist. I don't intend simply to re-create the how-tos, such as the ones listed in the Resources section here, so I will just hit the highlights to demonstrate some of the considerations and how easy this build really was.
The design goals I needed to meet were as follows. The item of first priority was high availability of customer media, data and application information. There should be no single point of failure, and failover should be automatic in all failure cases. Performance must allow for voice mail to be recorded from a large number of phone calls and many concurrent customers to access and play that voice mail from many application servers simultaneously. In addition, I would store application data on the SAN as well. Given expected growth over several years, meeting the availability requirements was going to be more difficult than meeting the performance requirements.
After considering and pricing several options, I decided to go with two of Coraid's 16-disk shelves and its failover kit that comes with two Debian-based servers for NFS/CIFS gateways and a STONITH- (Shoot The Other Node In The Head) ready power supply. After deciding on this, the next decisions were what disks to use and what Ethernet gear to use. After a lot of consideration, I went with the Western Digital RE3 line for the performance and relatively high MTBF for SATA II gear. I chose 500 gigabyte disks for their cost/gigabyte (at the time I bought them) and ease of acquisition and availability. I started out with ten disks per shelf configured in RAID 10 arrays. These arrays have been going for about two years now, and so far, I've lost only one of the original disks. I also discovered how easy and uneventful it is to replace disks on these shelves.
The other big consideration is that you definitely want to make sure your client NIC and switches do jumbo Ethernet frames, as the ability to aggregate ATA commands and data blocks on return will do wonders for your performance, so check out the list of jumbo-capable gear in the Resources section. I wound up buying four older Cisco Catalysts running IOS that I was able to secure inexpensively and run with a 9K MTU, which gives me two to run in production with two spares on standby.
The configuration on the gateway servers combines the two arrays, one RAID 10 array on each shelf, into a RAID 1 array. There is a performance penalty for this, as every write must be duplicated to both devices, and this doubles the overhead, but availability was the paramount concern here. Either component can fail, and your array on the gateway will degrade but continue to run until you can repair the failed shelf and/or disks. The two gateway servers are connected via a heartbeat over serial cable, so in the event of the failure of the primary, within seconds, the backup comes on-line, mounts the AoE block devices into the RAID 1, and after a pause, clients keep working as if nothing has happened. Exactly how to set this up is detailed in the Resources section.
I tested this in a variety of scenarios, and the biggest snag I hit was when each device was single-homed to one switch. In this scenario, one gateway and one shelf are on one switch, and the other server and the other shelf are on the other switch with a trunk port between switches. The problem occurs if you lose either switch. This causes things to hang indefinitely unless you do some manual tuning to the arpping utility and scripts used for failover. I had been a bit afraid of that, so I had to figure out how to dual-home all the components and make sure I had multiple uplinks between switches.
The uplinks were easy enough, just plug in two cables, tune spanning tree, and let STP figure it out. Of course, spanning tree doesn't “fail soft”, so you can use a redundant trunk in IOS or something of that sort as well. Once I'd gotten that out of the way, I turned my attention to what was sure to be the most difficult part of this setup. Having tried to do multiport configurations with iSCSI and Fibre Channel, I was really dreading setting up dual homing with AoE. Here's my harrowing tale of getting it to work.
First, plug an additional port from each shelf in to each switch and turn the ports on in the configuration. Second, plug a second port on the gateways in to the AoE switches, tune its MTU, and turn it on in the configuration. Next comes the hard part. Wipe the sweat from your brow, call it a day, and go brag about how you mastered multipath AoE. Yes, it is that easy. Coraid's gear and the AoE protocol automatically discover devices and paths to them using a query-packet mechanism that makes the setup brain-dead simple, using a round-robin approach to sending packets when it finds multiple paths between the target and the client. Once you have this in place, you could lose one of each of your components: gateway, links, switches and shelves, and continue to run.
Configuring a LUN and exporting it is super simple and covered in multiple how-tos that can be found in multiple places. You log in to the shelves from a device running AoE on the same Ethernet segment with a utility called cec (Coraid Ethernet console) and issue a sequence of commands (the example below is for a different set of drives than those mentioned above):
SR shelf 1> show -l 1.0 500GB up 1.1 500GB up 1.2 500GB up 1.3 500GB up 1.4 500GB up 1.5 500GB up 1.6 500GB up 1.7 500GB up 1.8 500GB up 1.9 500GB up 1.10 500GB up 1.11 500GB up 1.12 500GB up 1.13 500GB up 1.14 500GB up 1.15 500GB up SR shelf 1> make 0 raid10 1.0-15 SR shelf 1> list -l 0 4000GB offline 0.0 4000GB raid10 normal 0.0.0 normal ... SR shelf 1> online 0 SR shelf 1> online 0 4000GB online
Of course, you could create any number of LUNs here using any combination of RAID 0,1,5,6,10 JBOD or any other supported RAID type.
To use your new LUN on a connected server, it is as simple as:
client:/# aoe-discover client:/# aoe-stat e1.0 4000GB eth0 up client:/# mkfs.ext4 /dev/etherd/e1.0 client:/# mount /dev/etherd/e1.0 /mnt/aoe
It really is just as easy as that. I hope you can see the flexibility and the power this approach affords—simple management for complex architectures. I have validated this architecture and this methodology by using it in production for the last two years. There is so much more that can be done, and so much more I plan to do. If you remember my last article regarding VirtualBox in the October 2010 issue of LJ, you know that my next project is to move our virtual machine images onto AoE back ends to complete the requirements to allow me to be able to teleport running virtual machines between virtual hosts in our production environment. My next project after that will be to experiment with AoE and GFS (global filesystem) to eliminate the gateway server and give multiple servers access to the same LUNs at the same time. Should be fun!
-- I was cloud before cloud was cool. Not in the sense of being an amorphous collection of loosely related molecules with indeterminate borders -- or maybe I am. Holla @geek_king, http://twitter.com/geek_king
- The Ubuntu Conspiracy
- A First Look at IBM's New Linux Servers
- Vigilante Malware
- Disney's Linux Light Bulbs (Not a "Luxo Jr." Reboot)
- Vagrant Simplified
- Libreboot on an X60, Part I: the Setup
- System Status as SMS Text Messages
- Dealing with Boundary Issues
- Bluetooth Hacks
- Non-Linux FOSS: Code Your Way To Victory!