Use Linux as a SAN Provider
Storage Area Networks (SANs) are becoming commonplace in the industry. Once restricted to large data centers and Fortune 100 companies, this technology has dropped in price to the point that small startups are using them for centralized storage. The strict definition of a SAN is a set of storage devices that are accessible over the network at a block level. This differs from a Network Attached Storage (NAS) device in that a NAS runs its own filesystem and presents that volume to the network; it does not need to be formatted by the client machine. Whereas a NAS usually is presented with the NFS or CIFS protocol, a SAN running on the same Ethernet often is presented as iSCSI, although other technologies exist.
iSCSI is the same SCSI protocol used for local disks, but encapsulated inside IP to allow it to run over the network in the same way any other IP protocol does. Because of this, and because it is seen as a block device, it often is almost indistinguishable from a local disk from the point of view of the client's operating system and is completely transparent to applications.
The iSCSI protocol is defined in RFC 3720 and runs over TCP ports 860 and 3260. In addition to the iSCSI protocol, many SANs implement Fibre Channel as a mechanism. This is an improvement over Gigabit Ethernet, mainly because it is 4 or 8Gb/s as opposed to 1Gb/s. In the same vein, 10 Gigabit Ethernet would have an advantage over Fibre Channel.
The downside to Fibre Channel is the expense. A Fibre Channel switch often runs many times the cost of a typical Ethernet switch and comes with far fewer ports. There are other advantages to Fibre Channel, such as the ability to run over very long distances, but these aren't usually the decision-making factors when purchasing a SAN.
In addition to Fibre Channel and iSCSI, ATA over Ethernet (AoE) also is starting to make some headway. In the same way that iSCSI provides SCSI commands over an IP network, AoE provides ATA commands over an Ethernet network. AoE actually is running directly on Ethernet, not on top of IP the way iSCSI does. Because of this, it has less overheard and often is faster than iSCSI in the same environment. The downside is that it cannot be routed. AoE also is far less mature than iSCSI, and fewer large networking companies are looking to support AoE. Another disadvantage of AoE is that it has no built-in security outside of MAC filtering. As it is relatively easy to spoof a MAC address, this means anyone on the local network can access any AoE volumes.
The first step in moving down the road to a SAN is the choice of whether to use it. Although a SAN often is faster than a NAS, it also is less flexible. For example, the size of or the filesystem of a NAS usually can be changed on the host system without the client system having to make any changes. With a SAN, because it is seen as a block device like a local disk, it is subject to a lot of the same rules as a local disk. So, if a client is running its /usr filesystem on an iSCSI device, it would have to be taken off-line and modified not just on the server side, but also on the client side. The client would have to grow the filesystem on top of the device.
There are some significant differences between a SAN volume and a local disk. A SAN volume can be shared between computers. Often, this presents all kinds of locking problems, but with an application aware that its volume is shared out to multiple systems, this can be a powerful tool for failover, load balancing or communication. Many filesystems exist that are designed to be shared. GFS from Red Hat and OCFS from Oracle (both GPL) are great examples of the kinds of these filesystems.
The network is another consideration in choosing a SAN. Gigabit Ethernet is the practical minimum for running modern network storage. Although a 100- or even a 10-megabit network theoretically would work, the practical results would be extremely slow. If you are running many volumes or requiring lots of reads and writes to the SAN, consider running a dedicated gigabit network. This will prevent the SAN data from conflicting with your regular IP data and, as an added bonus, increase security on your storage.
Security also is a concern. Because none of the major SAN protocols are encrypted, a network sniffer could expose your data. In theory, iSCSI could be run over IPsec or a similar protocol, but without hardware acceleration, doing so would mean a large drop in performance. In lieu of this, at the very least, keeping the SAN data on its own VLAN is required.
Because it is the most popular of the various SAN protocols available for Linux, I use iSCSI in the examples in this article. But, the concepts should transfer easily to AoE if you've selected that for your systems. If you've selected Fibre Channel, things still are similar, but not as similar. You will need to rely more on your switch for most of your authentication and routing. On the positive side, most modern Fibre Channel switches provide excellent setup tools for doing this.
To this point, I have been using the terms client and server, but that is not completely accurate for iSCSI technology. In the iSCSI world, people refer to clients as initiators and servers or other iSCSI storage devices as targets. Here, I use the Open-iSCSI Project to provide the initiator and the iSCSI Enterprise Target (IET) Project to provide the target. These pieces of software are available in the default repositories of most major Linux distributions. Consult your distribution's documentation for the package names to install or download the source from www.open-iscsi.org and iscsitarget.sourceforge.net. Additionally, you'll need iSCSI over TCP/IP in your kernel, selectable in the low-level SCSI drivers section.
In preparation for setting up the target, you need to provide it with a disk. This can be a physical disk or you can create a disk image. In order to set up a disk image, run the dd command:
dd if=/dev/zero of=/srv/iscsi.image.0 bs=1 seek=10M count=1
This command creates a file about 10MB called /srv/iscsi.image.0 filled with zeros. This is going to represent the first iscsi disk. To create another, do this:
dd if=/dev/zero of=/srv/iscsi.image.1 bs=1 seek=10M count=1
Configuration for the IET software is located in /etc/ietd.conf. Though a lot of tweaks are available in the file, the important lines really are just the target name and LUN. For each target, exported disks must have a unique LUN. Target names are formatted specially. The official term for this name is the iSCSI Qualified Name (IQN).
The format is:
iqn.yyyy-mm.(reversed domain name):label
where iqn is required, yyyy signifies a four-digit year, followed by mm (a two-digit month) and a reversed domain name, such as org.michaelnugent. The label is a user-defined string in order to better identify the target.
Here is an example ietd.conf file using the images created above and a physical disk, sdd:
Target iqn.2009-05.org.michaelnugent:iscsi-target IncomingUser michael secretpasswd OutgoingUser michael secretpasswd Lun 0 Path=/srv/iscsi.images.0,Type=fileio Lun 1 Path=/srv/iscsi.images.1,Type=fileio Lun 2 Path=/dev/sdd,Type=blockio
The IncomingUser is used during discovery to authenticate iSCSI initiators. If it is not specified, any initiator will be allowed to connect to open a session. The OutgoingUser is used during discovery to authenticate the target to the initiator. For simplicity, I made them the same in this example, but they don't need to be. Note that both of these are required by the RFC to be 12 characters long. The Microsoft initiator enforces this strictly, though the Linux one does not.
Start the server using /etc/init.d/iscsitarget start (this may change depending on your distribution). Running ps ax | grep ietd will show you that the server is running.
Now you can move on to setting up the initiator to receive data from the target. To set up an initiator, place its name (in IQN format) in the /etc/iscsi/initiatorname.iscsi file (or possibly /etc/initiatorname.iscsi). An example of a well-formatted file would be the following:
In addition, you also need to modify the /etc/iscsi/iscsid.conf file to match the user names and passwords set in the ietd.conf file above:
node.session.auth.authmethod = CHAP node.session.auth.username = michael node.session.auth.password = secretpasswd node.session.auth.username_in = michael node.session.auth.password_in = secretpasswd discovery.sendtargets.auth.authmethod = CHAP discovery.sendtargets.auth.username = michael discovery.sendtargets.auth.password = secretpasswd discovery.sendtargets.auth.username_in = michael discovery.sendtargets.auth.password_in = secretpasswd
Once this is done, run the iscsiadm command to discover the target.
iscsiadm -m discovery -t sendtargets -p 192.168.0.1 -P 1
This should output the following:
Target: iqn.2009-05.org.michaelnugent:iscsi-target Portal: 192.168.0.1:32360,1 IFace Name: default
Now, at any time, you can run:
iscsiadm -m node -P1
which will redisplay the target information.
Now, run /etc/init.d/iscsi restart. Doing so will connect to the new block devices. Run dmesg and fdisk -l to view them. Because these are raw block devices, they look like physical disks to Linux. They'll show up as the next SCSI device, such as /dev/sdb. They still need to be partitioned and formatted to be usable. After this is done, mount them normally and they'll be ready to use.
This sets up the average iSCSI volume. Often though, you may want machines to run entirely diskless. For that, you need to run root on iSCSI as well. This is a bit more involved. The easiest, but more expensive way is to employ a network card with iSCSI built in. That allows the card to mount the volume and present it without having to do any additional work. On the downside, these cards are significantly more expensive than the average network card.
To create a diskless system without an iSCSI-capable network card, you need to employ PXE boot. This requires that a DHCP server be available in order for the initiator to receive an address. That DHCP server will have to refer to a TFTP server in order for the machine to download its kernel and initial ramdisk. That kernel and ramdisk will have iSCSI and discovery information in it. This enables the average PXE-enabled card to act as a more expensive iSCSI-enabled network card.
Another feature often run with iSCSI is multipathing. This allows Linux to use multiple networks at once to access the iSCSI target. It usually is run on separate physical networks, so in the event that one fails, the other still will be up and the initiator will not experience loss of a volume or a system crash. Multipathing can be set up in two ways, either active/passive or active/active. Active/active generally is the preferred way, as it can be set up not only for redundancy, but also for load balancing. Like Fibre Channel, multipath assigns World Wide Identifiers (WWIDs) to devices. These are guaranteed to be unique and unchanging. When one of the paths is removed, the other one continues to function. The initiator may experience slower response time, but it will continue to function. Re-integrating the second path allows the system to return to its normal state.
When working with local disks, people often turn to Linux's software RAID or LVM systems to provide redundancy, growth and snapshotting. Because SAN volumes show up as block devices, it is possible to use these tools on them as well. Use them with care though. Setting up RAID 5 across three iSCSI volumes causes a great deal of network traffic and almost never gives you the results you're expecting. Although, if you have enough bandwidth available and you aren't doing many writes, a RAID 1 setup across multiple iSCSI volumes may not be completely out of the question. If one of these volumes drops, rebuilding may be an expensive process. Be careful about how much bandwidth you allocate to rebuilding the array if you're in a production environment. Note that this could be used at the same time as multipathing in order to increase your bandwidth.
To set up RAID 1 over iSCSI, first load the RAID 1 module:
After partitioning your first disk, /dev/sdb, copy the partition table to your second disk, /dev/sdc. Remember to set the partition type to Linux RAID autodetect:
sfdisk -d /dev/sdb | sfdisk /dev/sdc
Assuming you set up only one partition, use the mdadm command to create the RAID group:
mdadm --create /dev/md0 --level=1 --raid-disks=2 /dev/sdb1 /dev/sdc1
After that, cat the /etc/mdstat file to watch the state of the synchronization of the iSCSI volumes. This also is a good time to measure your network throughput to see if it will stand up under production conditions.
Running a SAN on Linux is an excellent way to bring up a shared environment in a reasonable amount of time using commodity parts. Spending a few thousand dollars to create a multiterabyte array is a small budget when many commercial arrays easily can extend into the tens to hundreds of thousands of dollars. In addition, you gain flexibility. Linux allows you to manipulate the underlying technologies in ways most of the commercial arrays do not. If you're looking for a more-polished solution, the Openfiler Project provides a nice layout and GUI to navigate. It's worth noting that many commercial solutions run a Linux kernel under their shell, so unless you specifically need features or support that isn't available with standard Linux tools, there's little reason to look to commercial vendors for a SAN solution.
Michael Nugent has spent a good deal of his time designing large-scale solutions to fit into tiny budgets, leveraging Linux to fulfill roles that typically would be filled by large commercial appliances. Recently, Michael has been working to design large, private clouds for SaaS environments in the financial industry. When not building systems, he likes sailing, scuba diving and hanging out with his cat, MIDI. Michael can be reached at firstname.lastname@example.org.