NVMe over Fabrics Support Coming to the Linux 4.8 Kernel

The Flash Memory Summit recently wrapped up its conferences in Santa Clara, California, and only one type of Flash technology stole the show: NVMe over Fabrics (NVMeF). From the many presentations and company announcements, it was obvious NVMeF was the topic that most interested the attendees.

With the first industry specifications announced in 2011, Non-Volatile Memory Express (NVMe) quickly rose to the forefront of Solid State Drive (SSD) technologies. Historically, SSDs were built on top of Serial ATA (SATA), Serial Attached SCSI (SAS) and Fibre Channel buses. These interfaces worked well for the maturing Flash memory technology, but with all the protocol overhead and bus speed limitations, it did not take long for these drives to experience performance bottlenecks. Today, modern SAS drives operate at 12 Gbit/s, while modern SATA drives operate at 6 Gbit/s. This is why the technology shifted its focus to PCI Express (PCIe). With the bus closer to the CPU and PCIe capable of performing at increasingly stellar speeds, SSDs seemed to fit right in. Using PCIe 3.0, modern drives can achieve speeds as high as 40 Gbit/s. Leveraging the benefits of PCIe, it was then that the NVMe was conceived. Support for NVMe drives was integrated into the Linux 3.3 mainline kernel (2012).

What really makes NVMe shine over the operating system's SCSI stack is its simpler and faster queueing mechanism. These are called the Submission Queue (SQ) and Completion Queue (CQ). Each queue is a circular buffer of a fixed size that the operating system uses to submit one or more commands to the NVMe controller. One or more of these queues also can be pinned to specific cores, which allows for more uninterrupted operations.

Almost immediately, the PCIe SSDs were marketed for enterprise-class computing with a much higher price tag. Although still more expensive than its SAS or SATA cousins, the dollar per gigabyte of Flash memory continues to drop—enough to convince more companies to adopt the technology. However, there was still a problem. Unlike the SAS or SATA SSDs, NVMe drives did not scale very well. They were confined to the server they were plugged in to.

In the world of SAS or SATA, you have the Storage Area Network (SAN). SANs are designed around SCSI standards. The primary goal of a SAN (or any other storage network) is to provide access of one or more storage volumes across one or more paths to a single or multiple operating system host(s) in a network. Today, the most commonly deployed SAN is based on iSCSI, which is SCSI over TCP/IP. Technically, NVMe drives can be configured within a SAN environment, although the protocol overhead introduces latencies that make it a less than ideal implementation. In 2014, the NVMe Express committee was poised to rectify this with the NVMeF standard.

The goals behind NVMeF are simple: enable an NVMe transport bridge, which is built around the NVMe queuing architecture, and avoid any and all protocol translation overhead other than the supported NVMe commands (end to end). With such a design, network latencies noticeably drop (less than 200 ns). This design relies on the use of PCIe switches. There is a second design that has been gaining ground and that is based on the existing Ethernet fabrics using Remote Direct Memory Access (RDMA).

A Tale of Two Networks: a comparison between PCIe Fabrics and Other Storage Networks

Call it a coincidence, but also recently, the first release candidate for the 4.8 kernel introduced a lot of new code to support NVMeF. The patches were submitted as part of a joint effort by the hard-working developers over at Intel, Samsung and others. Three major components were patched into the release candidate of the kernel. This includes the general NVMe Target Support framework. This framework enables block devices to be exported from the Linux kernel using the NVMe protocol. Dependent upon this framework, there is now support for NVMe loopback devices and also NVMe over Fabrics RDMA Targets. If you recall, this last piece is one of the two more common NVMeF deployments. When a target is exported, it is done so with a "unique" NVMe Qualified Name (NQN). The concept is very similar to the iSCSI Qualified Name (IQN). This NQN is what enables other operating systems to import and use the remote NVMe device across a network potentially hosting multiple NVMe devices.

Anyway, a lot of new and exciting things are in the works for Solid State Drives and the Linux kernel. We just need to keep our eyes open to see what comes next.

______________________

Petros Koutoupis is a software developer at IBM for its Cloud Object Storage division (formerly Cleversafe). He is also the creator and maintainer of the RapidDisk Project. Petros has worked in the data storage industry for more than a decade.