Data in a Flash, Part III: NVMe over Fabrics Using TCP

on June 10, 2019

A remote NVMe block device exported via an NVMe over Fabrics network using TCP.

Version 5.0 of the Linux kernel brought with it many wonderful features, one of which was the introduction of NVMe over Fabrics (NVMeoF) across native TCP. If you recall, in the previous part to this series ("Data in a Flash, Part II: Using NVMe Drives and Creating an NVMe over Fabrics Network", I explained how to enable your NVMe network across RDMA (an Infiniband protocol) through a little method referred to as RDMA over Converged Ethernet (RoCE). As the name implies, it allows for the transfer of RDMA across a traditional Ethernet network. And although this works well, it introduces a bit of overhead (along with latencies). So when the 5.0 kernel introduced native TCP support for NVMe targets, it simplifies the method or procedure one needs to take to configure the same network, as shown in my last article, and it also makes accessing the remote NVMe drive faster.

Software Requirements

To continue with this tutorial, you'll need to have a 5.0 Linux kernel or later installed, with the following modules built and inserted into the operating systems of both your initiator (the server importing the remote NVMe volume) and the target (the server exporting its local NVMe volume):


# NVME Support
CONFIG_NVME_CORE=y
CONFIG_BLK_DEV_NVME=y
# CONFIG_NVME_MULTIPATH is not set
CONFIG_NVME_FABRICS=m
CONFIG_NVME_RDMA=m
# CONFIG_NVME_FC is not set
CONFIG_NVME_TCP=m
CONFIG_NVME_TARGET=m
CONFIG_NVME_TARGET_LOOP=m
CONFIG_NVME_TARGET_RDMA=m
# CONFIG_NVME_TARGET_FC is not set
CONFIG_NVME_TARGET_TCP=m

More specifically, you need the module to import the remote NVMe volume:


CONFIG_NVME_TCP=m

And the module to export a local NVMe volume:


CONFIG_NVME_TARGET_TCP=m

Before continuing, make sure your physical (or virtual) machine is up to date. And once you verify that to be the case, make sure you are able to see all locally connected NVMe devices (which you'll export across your network):


$ cat /proc/partitions |grep -e nvme -e major
major minor  #blocks  name
 259        0 3907018584 nvme2n1
 259        1 3907018584 nvme3n1
 259        2 3907018584 nvme0n1
 259        3 3907018584 nvme1n1

If you don't see any connected NVMe devices, make sure the kernel module is loaded:


petros@ubu-nvme1:~$ lsmod|grep nvme
nvme                   32768  0
nvme_core              61440  1 nvme

The following modules need to be loaded on the initiator:


$ sudo modprobe nvme
$ sudo modprobe nvme-tcp

And, the following modules need to be loaded on the target:


$ sudo modprobe nvmet
$ sudo modprobe nvmet-tcp

Next, you'll install the drive management utility called nvme-cli. This utility is defined and maintained by the very same NVM Express committee that has defined the NVMe specification. You can find the GitHub repository hosting the source code here. A recent build is needed. Clone the source code from the GitHub repository. Build and install it:


$ make
$ make install

Accessing the Drive across a Network over TCP

The purpose of this section is to leverage the high-speed SSD technology and expand it beyond the local server. An NVMe does not have to be limited to the server it is physically plugged in to. In this example, and for the sake of convenience, I'm using two virtual machines to create this network. There is absolutely no advantage in doing this, and I wouldn't recommend you do the same unless you just want to follow the exercise. Realistically, you should enable the following only on physical machines with high-speed network cards connected. Anyway, in the target virtual machine, I attached a couple of low-capacity virtual NVMe drives (2GB each):


$ sudo nvme list
Node           SN             Model                  Namespace
-------------- -------------- ---------------------- ---------
/dev/nvme0n1   VB1234-56789   ORCL-VBOX-NVME-VER12     1
/dev/nvme0n2   VB1234-56789   ORCL-VBOX-NVME-VER12     2

Usage                      Format           FW Rev
-------------------------- ---------------- --------
2.15  GB /   2.15  GB      512   B +  0 B   1.0
2.15  GB /   2.15  GB      512   B +  0 B   1.0

[Note: the tabular output above has been modified for readability.]

The following instructions rely heavily on the sysfs virtual filesystem. In theory, you could export NVMe targets with the open-source utility, nvmet-cli, which does all of that complex heavy lifting. But, where is the fun in that?

Exporting a Target

Mount the kernel user configuration filesystem. This is a requirement. All of the NVMe Target instructions require the NVMe Target tree made available in this filesystem:


$ sudo /bin/mount -t configfs none /sys/kernel/config/

Create an NVMe Target subsystem to host your devices (to export) and change into its directory:


$ sudo mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test
$ cd /sys/kernel/config/nvmet/subsystems/nvmet-test

This example will simplify host connections by leaving the newly created subsystem accessible to any and every host attempting to connect to it. In a production environment, you definitely should lock this down to specific host machines by their NQN:


$ echo 1 |sudo tee -a attr_allow_any_host > /dev/null

When a target is exported, it is done so with a "unique" NVMe Qualified Name (NQN). The concept is very similar to the iSCSI Qualified Name (IQN). This NQN is what enables other operating systems to import and use the remote NVMe device across a network potentially hosting multiple NVMe devices.

Define a subsystem namespace and change into its directory:


$ sudo mkdir namespaces/1
$ cd namespaces/1/

Set a local NVMe device to the newly created namespace:


$ echo -n /dev/nvme0n1 |sudo tee -a device_path > /dev/null

And enable the namespace:


$ echo 1|sudo tee -a enable > /dev/null

Now, you'll create an NVMe Target port to export the newly created subsystem and change into its directory path:


$ sudo mkdir /sys/kernel/config/nvmet/ports/1
$ cd /sys/kernel/config/nvmet/ports/1

Well, you'll use the IP address of your preferred Ethernet interface port when exporting your subsystem (for example, eth0):


$ echo 192.168.1.92 |sudo tee -a addr_traddr > /dev/null

Then, you'll set a few other parameters:


$ echo tcp|sudo tee -a addr_trtype > /dev/null
$ echo 4420|sudo tee -a addr_trsvcid > /dev/null
$ echo ipv4|sudo tee -a addr_adrfam > /dev/null

And create a softlink to point to the subsystem from your newly created port:


$ sudo ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test/
 ↪/sys/kernel/config/nvmet/ports/1/subsystems/nvmet-test

You now should see the following message captured in dmesg:


$ dmesg |grep "nvmet_tcp"
[24457.458325] nvmet_tcp: enabling port 1 (192.168.1.92:4420)

Importing a Target

The host machine is currently without an NVMe device:


$ nvme list
Node      SN           Model                    Namespace
--------- ------------ ------------------------ ---------

Usage          Format           FW Rev
-------------- ---------------- --------

[Note: the tabular output above has been modified for readability.]

Scan your target machine for any exported NVMe volumes:


$ sudo nvme discover -t tcp -a 192.168.1.92 -s 4420

Discovery Log Number of Records 1, Generation counter 1
=====Discovery Log Entry 0======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified, sq flow control disable supported
portid:  1
trsvcid: 4420
subnqn:  nvmet-test
traddr:  192.168.1.92
sectype: none

It must be your lucky day. It looks as if the target machine is exporting one or more volumes. You'll need to remember its subnqn field: nvmet-test. Now connect to the subnqn:


$ sudo nvme connect -t tcp -n nvmet-test -a 192.168.1.92 -s 4420

If you go back to list all NVMe devices, you now should see all those exported by that one subnqn:


$ sudo nvme list
Node             SN                   Model
---------------- -------------------- ------------------------
/dev/nvme1n1     8e0999a558e17818     Linux


Namespace Usage                   Format           FW Rev
--------- ----------------------- ---------------- --------
1         2.15  GB /   2.15  GB   512   B +  0 B    5.0.0-3

[Note: the tabular output above has been modified for readability.]

Verify that it also shows up like your other block device:


$ cat /proc/partitions |grep nvme
 259        1    2097152 nvme1n1

You can disconnect from the target device by typing:


$ sudo nvme disconnect -d /dev/nvme1n1

Summary

There you have it—a remote NVMe block device exported via an NVMe over Fabrics network using TCP. Now you can write to and read from it like any other locally attached high-performance block device. The fact that you now can map the block device over TCP without the additional overhead should and will accelerate adoption of the technology.

Resources

"Data in a Flash, Part I: the Evolution of Disk Storage and an Introduction to NVMe" by Petros Koutoupis, LJ, December 2018
"Data in a Flash, Part II: Using NVMe Drives and Creating an NVMe over Fabrics Network" by Petros Koutoupis, LJ, December 2018
nvme-cl GitHub Repository

Petros Koutoupis, LJ Editor at Large, is currently a senior performance software engineer at Cray for its Lustre High Performance File System division. He is also the creator and maintainer of the RapidDisk Project. Petros has worked in the data storage industry for well over a decade and has helped pioneer the many technologies unleashed in the wild today.

Load Disqus comments

NVMe

Storage

Data in a Flash, Part III: NVMe over Fabrics Using TCP

Recent Articles

Related Articles