InfiniBand and Linux

Learn why letting a remote system on the network scribble in your memory is fine, how user-space applications can send data without bothering the kernel and more facts about the new high-performance interconnect.

By the way, proper error checking has been omitted from the listings, although any real kernel code must check the return values of all functions for failure. All of the IB functions that allocate resources and return pointers use the standard Linux method for returning errors by way of the ERR_PTR() macro, which means that the status can be tested with IS_ERR(). For example, the call to ib_alloc_pd() in the real kernel actually looks like:

priv->pd = ib_alloc_pd(priv->ca);
if (IS_ERR(priv->pd)) {
        printk(KERN_WARNING "%s: failed "
               "to allocate PD\n", ca->name);
        return -ENODEV;

Next, the driver calls ib_create_cq(), which creates a completion queue (CQ). The driver requests that the function ipoib_ib_completion() be called when a completion event occurs and that the CQ be able to hold at least IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1 work completion structures. This size is required to handle the extreme case when the driver posts its maximum number of sends and receives and then does not get to run until they all have generated completions. Confusingly enough, CQs are the one IB resource not associated with a PD, so we don't have to pass our PD to this function.

Once the CQ is created, the driver calls ib_req_notify_cq() to request that the completion event function be called for the next work completion added to the CQ. The event function, ipoib_ib_completion(), processes completions until the CQ is empty. It then repeats the call to ib_req_notify_cq() so it is called again when more completions are available.

The driver then calls ib_get_dma_mr() to set up a memory region (MR) that can be used with DMA addresses obtained from the kernel's DMA mapping API. Translation tables are set up in the IB HCA to handle this, and a local key (L_Key) is returned that can be passed back to the HCA in order to refer to this MR.

Finally, the driver calls ib_create_qp() to create a queue pair (QP). This object is called a queue pair because it consists of a pair of work queues—one queue for send requests and one queue for receive requests. Creating a QP requires filling in the fairly large ib_qp_init_attr struct. The cap structure gives the sizes of the send and receive queues that are to be created. The sq_sig_type and rq_sig_type fields are set to IB_SIGNAL_ALL_WR so that all work requests generate a completion.

The qp_type field is set to IB_QPT_UD so that an unreliable datagram (UD) QP is created. There are four possible transports for an IB QP: reliable connected (RC), reliable datagram (RD), unreliable connected (UC) and unreliable datagram (UD). For the reliable transports, the IB hardware guarantees that all messages either are delivered successfully or generate an error if an unrecoverable error, such as a cable being unplugged, occurs. For connected transports, all messages go to a single destination, which is set when the QP is set up, while datagram transports allow each message to be sent to a different destination.

Once the IPoIB driver has created its QP, it uses the QP to send the packets given to it by the network stack. Listing 2 shows what is required to post a request to the send queue of the QP.

First, the driver sets up the gather list for the send request. The lkey field is set to the L_Key of the MR that came from ib_get_dma_mr(). Because the IPoIB is sending packets that are in one contiguous chunk, the gather list has only a single entry. The driver simply has to assign the address and length of the packet. The address in the gather list is a DMA address obtained from dma_map_single() rather than a virtual address. In general, software can use a longer gather list to have the HCA collect multiple buffers into a single message to avoid having to copy data into a single buffer.

The driver then fills in the rest of the fields of the send work request. The opcode is set to send, sg_list and num_sge are set for the gather list just filled in and the send flags are set to signaled so that the work request generates a completion when it finishes. The remote QP number and address handle are set, and the wr_id field is set to the driver's work request ID.

Once the work request is filled in, the driver calls ib_post_send(), which actually adds the request to the send queue. When the request is completed by the IB hardware, a work completion is added to the driver's CQ and eventually is handled by ipoib_ib_completion().

InfiniBand can do a lot, and the OpenIB Alliance is only getting started writing software to do it all. Now that Linux has basic support for IB, we will be implementing more upper-level protocols, including SDP and storage protocols. Another major area we are tackling is support for direct user-space access to IB—the kernel bypass feature we talked about earlier. There's plenty of interesting work to be done on IB, and the OpenIB Project is open to everyone, so come join the fun.

Resources for this article: /article/8131.

Roland Dreier is the maintainer and lead developer for Linux InfiniBand drivers through the project. Roland received his PhD in Mathematics from the University of California at Berkeley and has held a variety of positions in academic research and high tech. He has been employed by Topspin Communications since 2001.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Good article

sxg's picture

Very informative, thank you~