Kernel Korner - ATA Over Ethernet: Putting Hard Drives on the LAN
The following example is based on a true story. Stan is a fictional sysadmin working for the state government. New state legislation requires that all official documents be archived permanently. Any state resident can demand to see any official document at any time. Stan therefore needs a huge storage capacity that can grow without bounds. The performance of the storage needn't be any better than a local ATA disk, though. He wants all of the data to be retrievable easily and immediately.
Stan is comfortable with Ethernet networking and Linux system administration, so he decides to try ATA over Ethernet. He buys some equipment, paying a bit less than $6,500 US for all of the following:
One dual-port gigabit Ethernet card to replace the old 100Mb card in his server.
One 26-port network switch with two gigabit ports.
One Coraid EtherDrive shelf and ten EtherDrive blades.
Ten 400GB ATA drives.
The shelf of ten blades takes up three rack units. Each EtherDrive blade is a small computer that performs the AoE protocol to effectively put one ATA disk on the LAN. Striping data over the ten blades in the shelf results in about the throughput of a local ATA drive, so the gigabit link helps to use the throughput effectively. Although he could have put the EtherDrive blades on the same network as everyone else, he has decided to put the storage on its own network, connected to the server's second network interface, eth1, for security and performance.
Stan reads the Linux Software RAID HOWTO (see the on-line Resources) and decides to use a RAID 10—striping over mirrored pairs—configuration. Although this configuration doesn't result in as much usable capacity as a RAID 5 configuration, RAID 10 maximizes reliability, minimizes the CPU cost of performing RAID and has a shorter array re-initialization time if one disk should fail.
After reading the LVM HOWTO (see Resources), Stan comes up with a plan to avoid ever running out of disk space. JFS is a filesystem that can grow dynamically to large sizes, so he is going to put a JFS filesystem on a logical volume. The logical volume resides, for now, on only one physical volume. That physical volume is the RAID 10 block device. The RAID 10 is created from the EtherDrive storage blades in the Coraid shelf using Linux software RAID. Later, he can buy another full shelf, create another RAID 10, make it into a physical volume and use the new physical volume to extend the logical volume where his JFS lives.
Listing 1 shows the commands Stan uses to prepare his server for doing ATA over Ethernet. He builds the AoE driver with AOE_PARTITIONS=1, because he's using a Debian sarge system running a 2.6 kernel. Sarge doesn't support large minor device numbers yet (see the Minor Numbers sidebar), so he turns off disk partitioning support in order to be able to use more disks. Also, because of Debian bug 292070, Stan installs the latest device mapper and LVM2 userland software.
Listing 1. The first step in building a software RAID device from several AoE drives is setting up AoE.
# setting up the host for AoE # build and install the AoE driver tar xvfz aoe-2.6-5.tar.gz cd aoe-2.6-5 make AOE_PARTITIONS=1 install # AoE needs no IP addresses! :) ifconfig eth1 up # let the network interface come up sleep 5 # load the ATA over Ethernet driver modprobe aoe # see what aoe disks are available aoe-stat
Minor Device Numbers
A program that wants to use a device typically does so by opening a special file corresponding to that device. A familiar example is the /dev/hda file. An ls -l command shows two numbers for /dev/hda, 3 and 0. The major number is 3 and the minor number is 0. The /dev/hda1 file has a minor number of 1, and the major number is still 3.
Until kernel 2.6, the minor number was eight bits in size, limiting the possible minor numbers to 0 through 255. Nobody had that many devices, so the limitation didn't matter. Now that disks have been decoupled from servers, it does matter, and kernel 2.6 uses 20 bits for the minor device number.
Having 1,048,576 values for the minor number is a big help to systems that use many devices, but not all software has caught up. If glibc or a specific application still thinks of minor numbers as eight bits in size, you are going to have trouble using minor device numbers over 255.
To help during this transitional period, the AoE driver may be compiled without support for partitions. That way, instead of there being 16 minor numbers per disk, there's only one per disk. So even on systems that haven't caught up to the large minor device numbers of 2.6, you still can use up to 256 AoE disks.
The commands for creating the filesystem and its logical volume are shown in Listing 2. Stan decides to name the volume group ben and the logical volume franklin. LVM2 now needs a couple of tweaks made to its configuration. For one, it needs a line with types = [ "aoe", 16 ] so that LVM recognizes AoE disks. Next, it needs md_component_detection = 1, so the disks inside RAID 10 are ignored when the whole RAID 10 becomes a physical volume.
|Designing Electronics with Linux||May 22, 2013|
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Nice article, thanks for the
2 hours 43 min ago
- I once had a better way I
8 hours 29 min ago
- Not only you I too assumed
8 hours 46 min ago
- another very interesting
10 hours 39 min ago
- Reply to comment | Linux Journal
12 hours 33 min ago
- Reply to comment | Linux Journal
19 hours 27 min ago
- Reply to comment | Linux Journal
19 hours 43 min ago
- Favorite (and easily brute-forced) pw's
21 hours 34 min ago
- Have you tried Boxen? It's a
1 day 3 hours ago
- seo services in india
1 day 7 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?