Kernel Korner - ATA Over Ethernet: Putting Hard Drives on the LAN
Listing 2. Setting Up the Software RAID and the LVM Volume Group
# speed up RAID initialization for f in `find /proc | grep speed`; do echo 100000 > $f done # create mirrors (mdadm will manage hot spares) mdadm -C /dev/md1 -l 1 -n 2 \ /dev/etherd/e0.0 /dev/etherd/e0.1 mdadm -C /dev/md2 -l 1 -n 2 \ /dev/etherd/e0.2 /dev/etherd/e0.3 mdadm -C /dev/md3 -l 1 -n 2 \ /dev/etherd/e0.4 /dev/etherd/e0.5 mdadm -C /dev/md4 -l 1 -n 2 -x 2 \ /dev/etherd/e0.6 /dev/etherd/e0.7 \ /dev/etherd/e0.8 /dev/etherd/e0.9 sleep 1 # create the stripe over the mirrors mdadm -C /dev/md0 -l 0 -n 4 \ /dev/md1 /dev/md2 /dev/md3 /dev/md4 # make the RAID 10 into an LVM physical volume pvcreate /dev/md0 # create an extendible LVM volume group vgcreate ben /dev/md0 # look at how many "physical extents" there are vgdisplay ben | grep -i 'free.*PE' # create a logical volume using all the space lvcreate --extents 88349 --name franklin ben modprobe jfs mkfs -t jfs /dev/ben/franklin mkdir /bf mount /dev/ben/franklin /bf
I duplicated Stan's setup on a Debian sarge system with two 2.1GHz Athlon MP processors and 1GB of RAM, using an Intel PRO/1000 MT Dual-Port NIC and puny 40GB drives. The network switch was a Netgear FS526T. With a RAID 10 across eight of the EtherDrive blades in the Coraid shelf, I saw a sustainable read throughput of 23.58MB/s and a write throughput of 17.45MB/s. Each measurement was taken after flushing the page cache by copying a 1GB file to /dev/null, and a sync command was included in the write times.
The RAID 10 in this case has four stripe elements, each one a mirrored pair of drives. In general, you can estimate the throughput of a collection of EtherDrive blades easily by considering how many stripe elements there are. For RAID 10, there are half as many stripe elements as disks, because each disk is mirrored on another disk. For RAID 5, there effectively is one disk dedicated to parity data, leaving the rest of the disks as stripe elements.
The expected read throughput is the number of stripe elements times 6MB/s. That means if Stan bought two shelves initially and constructed an 18-blade RAID 10 instead of his 8-blade RAID 10, he would expect to get a little more than twice the throughput. Stan doesn't need that much throughput, though, and he wanted to start small, with a 1.6TB filesystem.
Listing 3 shows how Stan easily can expand the filesystem when he buys another shelf. The listings don't show Stan's mdadm-aoe.conf file or his startup and shutdown scripts. The mdadm configuration file tells an mdadm process running in monitor mode how to manage the hot spares, so that they're ready to replace any failed disk in any mirror. See spare groups in the mdadm man page.
Listing 3. To expand the filesystem without unmounting it, set up a second RAID 10 array, add it to the volume group and then increase the filesystem.
# after setting up a RAID 10 for the second shelf # as /dev/md5, add it to the volume group vgextend ben /dev/md5 vgdisplay ben | grep -i 'free.*PE' # grow the logical volume and then the jfs lvextend --extents +88349 /dev/ben/franklin mount -o remount,resize /bf
The startup and shutdown scripts are easy to create. The startup script simply assembles each mirrored pair RAID 1, assembles each RAID 0 and starts an mdadm monitor process. The shutdown script stops the mdadm monitor, stops the RAID 0s and, finally, stops the mirrors.
Now that we've seen a concrete example of ATA over Ethernet in action, readers might be wondering what would happen if another host had access to the storage network. Could that second host mount the JFS filesystem and access the same data? The short answer is, “Not safely!” JFS, like ext3 and most filesystems, is designed to be used by a single host. For these single-host filesystems, filesystem corruption can result when multiple hosts mount the same block storage device. The reason is the buffer cache, which is unified with the page cache in 2.6 kernels.
Linux aggressively caches filesystem data in RAM whenever possible in order to avoid using the slower block storage, gaining a significant performance boost. You've seen this caching in action if you've ever run a find command twice on the same directory.
Some filesystems are designed to be used by multiple hosts. Cluster filesystems, as they are called, have some way of making sure that the caches on all of the hosts stay in sync with the underlying filesystem. GFS is a great open-source example. GFS uses cluster management software to keep track of who is in the group of hosts accessing the filesystem. It uses locking to make sure that the different hosts cooperate when accessing the filesystem.
By using a cluster filesystem such as GFS, it is possible for multiple hosts on the Ethernet network to access the same block storage using ATA over Ethernet. There's no need for anything like an NFS server, because each host accesses the storage directly, distributing the I/O nicely. But there's a snag. Any time you're using a lot of disks, you're increasing the chances that one of the disks will fail. Usually you use RAID to take care of this issue by introducing some redundancy. Unfortunately, Linux software RAID is not cluster-aware. That means each host on the network cannot do RAID 10 using mdadm and have things simply work out.
Cluster software for Linux is developing at a furious pace. I believe we'll see good cluster-aware RAID within a year or two. Until then, there are a few options for clusters using AoE for shared block storage. The basic idea is to centralize the RAID functionality. You could buy a Coraid RAIDblade or two and have the cluster nodes access the storage exported by them. The RAIDblades can manage all the EtherDrive blades behind them. Or, if you're feeling adventurous, you also could do it yourself by using a Linux host that does software RAID and exports the resulting disk-failure-proofed block storage itself, by way of ATA over Ethernet. Check out the vblade program (see Resources) for an example of software that exports any storage using ATA over Ethernet.
Practical Task Scheduling Deployment
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.View Now!
|The Firebird Project's Firebird Relational Database||Jul 29, 2016|
|Stunnel Security for Oracle||Jul 28, 2016|
|SUSE LLC's SUSE Manager||Jul 21, 2016|
|My +1 Sword of Productivity||Jul 20, 2016|
|Non-Linux FOSS: Caffeine!||Jul 19, 2016|
|Murat Yener and Onur Dundar's Expert Android Studio (Wrox)||Jul 18, 2016|
- Stunnel Security for Oracle
- The Firebird Project's Firebird Relational Database
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- SUSE LLC's SUSE Manager
- Managing Linux Using Puppet
- My +1 Sword of Productivity
- Non-Linux FOSS: Caffeine!
- Google's SwiftShader Released
- Doing for User Space What We Did for Kernel Space
- SuperTuxKart 0.9.2 Released
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide