InfiniBand and Linux
After a long gestation, use of InfiniBand (IB) is taking off, and work is under way to add IB support to Linux. At the physical level, IB is similar to PCI Express. It carries data using multiple high-speed serial lanes. The first versions of the InfiniBand specification allowed for only the same signaling rate for each lane, 2.5Gb/s, as PCI Express. The latest version of the specification (1.2), however, adds support for 5Gb/s and 10Gb/s rates per lane. Also, IB supports widths of 1X, 4X, 8X and 12X, while PCI Express supports X1, X2, X4, X8, X12, X16 and X32. The most commonly used IB speed today is 4X at a 2.5Gb/s/lane rate, or 10Gb/s total. But the 12X width combined with the 10Gb/s/lane rate means the current IB spec supports links with an astonishing 120Gb/s of throughput.
Table 1. Comparison Chart
|Hi-Speed USB (USB 2.0)||480Mb/s||5m|
|IEEE 1394 (FireWire)||400Mb/s||4m|
|Gigagbit Ethernet||1,000Mb/s||100m (cat5 cable)|
|10 Gigabit Ethernet||10,000Mb/s||10m (copper IB cable), 1+ km ( optical)|
|Myrinet||2,000Mb/s||10m (copper), 200m (optical)|
|1X InfiniBand||2,000Mb/s||10m (copper), 1+ km (optical)|
|4X InfiniBand||8,000Mb/s||10m (copper), 1+ km (optical)|
|12X InfiniBand||24,000Mb/s||10m (copper), 1+ km (optical)|
Because IB is used to build network fabrics, IB supports both copper and optical cabling, while the PCI Express cable specification still is being developed. Most IB installations use copper cable (Figure 1), which can be used for distances up to about 10 meters. IB also allows a variety of optical cabling choices, which theoretically allow for links up to 10km.
In past years, IB was pitched as a replacement for PCI, but that no longer is expected to be the case. Instead, IB adapters should continue to be peripherals that connect to systems through PCI, PCI Express, HyperTransport or a similar peripheral bus.
The network adapters used to attach systems to an IB network are called host channel adapters (HCAs). In addition to the fabric's extremely high speed, IB HCAs also provide a message passing interface that allows systems to use the 10Gb/sec or more throughput offered by InfiniBand. To make use of IB's speed, supporting zero-copy networking is key; otherwise, applications will spend all their time copying data.
The HCA interface has three key features that make zero-copy possible: a high-level work queue abstraction, kernel bypass and remote direct memory access (RDMA). The work queue abstraction means that instead of having to construct and process network traffic packet by packet, applications post work requests to queues processed by the HCA. A message sent with a single work request can be up to 4GB long, with the HCA taking care of breaking the message into packets, waiting for acknowledgements and resending dropped packets. Because the HCA hardware takes care of delivering large messages without any involvement from the CPU, applications receive more CPU time to generate and process the data they send and receive.
Kernel bypass allows user applications to post work requests directly to and collect completion events directly from the HCAs queues, eliminating the system call overhead of switching to and from the kernel's context. A kernel driver sets up the queues, and standard memory protection is used to make sure that each process accesses only its own resources. All fast path operations, though, are done purely in user space.
The final piece, RDMA, allows messages to carry the destination address to which they should be written in memory. Specifying where data belongs is useful for applications such as serving storage over IB, where the server's reads from disk may complete out of order. Without RDMA, either the server has to waste time waiting when it has data ready to send or the client has to waste CPU power copying data to its final location.
Although the idea of remote systems scribbling on memory makes some queasy, IB allows applications to set strict address ranges and permissions for RDMA. If anything, IB RDMA is safer than letting a disk controller DMA into memory.
Beyond its high performance, IB also simplifies building and managing clusters by providing a single fabric that can carry networking and storage traffic in addition to cluster communication. Many groups have specified a wide variety of upper-level protocols that can run over IB, including:
IP-over-InfiniBand (IPoIB): the Internet Engineering Task Force (IETF) has a working group developing standards-track drafts for sending IP traffic over IB. These drafts eventually should lead to an RFC standard for IPoIB. IPoIB does not take full advantage of IB's performance, however, as traffic still passes through the IP stack and is sent packet by packet. IPoIB does provide a simple way to run legacy applications or send control traffic over IB.
Sockets Direct Protocol (SDP): the InfiniBand Trade Association itself has specified a protocol that maps standard socket operations onto native IB RDMA operations. This allows socket applications to run unchanged and still receive nearly all of IB's performance benefits.
SCSI RDMA Protocol (SRP): the InterNational Committee for Information Technology Standards (INCITS) T10 committee, which is responsible for SCSI standards, has published a standard for mapping the SCSI protocol onto IB. Work is underway on developing a second-generation SRP-2 protocol.
Many other groups also are studying and specifying the use of IB, including APIs from the DAT Collaborative and the Open Group's Interconnect Software Consortium, RDMA bindings for NFS and IB support for various MPI packages.
Of course, without open-source support, all of these fancy hardware capabilities are a lot less interesting to the Linux world. Fortunately, the OpenIB Alliance is an industry consortium dedicated to producing exactly that—a complete open-source IB stack. OpenIB currently has 15 member companies, including IB hardware vendors, server companies, software companies and research organizations.
Work on the OpenIB software began in February 2004, and the first kernel drivers were merged into Linux's tree in December 2004, right after the tree opened for 2.6.11 following the release of 2.6.10. The first batch of code merged into the kernel is the smallest set of IB drivers that do something useful. It contains a midlayer to abstract low-level hardware drivers from upper-level protocols, a single low-level driver for Mellanox HCAs, an IPoIB upper-level protocol driver and a driver to allow a subnet manager to run in user space.
A few snippets of code from the IPoIB driver should provide some understanding of how one can use the kernel's IB support. To see this code in context, you can look at the complete IPoIB driver, which is in the directory drivers/infiniband/ulp/ipoib in the Linux kernel source.
Listing 1 shows what the IPoIB driver does to allocate all of its IB resources. First, it calls ib_alloc_pd(), which allocates a protection domain (PD), an opaque container that every user of IB must have to hold other resources.