The Network Block Device

A network block device (NBD) driver makes a remote resource look like a local device in Linux, allowing a cheap and safe real-time mirror to be constructed.
Setting up a Connection

Steps in Setting Up an NBD-Based File System

The five steps required to create a file system mounted on a network remote device are outlined in Figure 4. For example, the following sequence of commands creates an approximately 160MB file on a local file system on the server, then launches the server to serve from it on port 1077. Steps 1 and 2 are

dd if=/dev/zero of=/mnt/remote bs=1024 count 16000
nbd-server 1077 /mnt/remote

On the client side, the driver module must be loaded into the kernel, and the client daemon started. The client daemon needs the server machine address (192.168.1.2), port number and the name of the special device file that will be the NBD. In the original driver, this is called /dev/nd0. Step 3 is

insmod nbd.o
nbd=client 192.168.1.2 1077 /dev/nd0
A file system can then be created on the NBD, and the system mounted locally with Steps 4 and 5:
mke2fs /dev/nd0
mount -text2 /dev/nd0 /mnt
In our current drivers, multiple ports and addresses are allowed, causing redundant connections to be initiated. Here, the server offers several ports instead of one:
dd if=/dev/zero of=/mnt/remote bs=1024 count=16000
nbd-server 1077 1078 1079 1080 /mnt/remote
The current client can use all these ports to the server, and here we direct two of them to a second IP (192.168.2.2) on the server so that we can route through a second network card on both machines and thus double the available bandwidth through our switched network.
insmod nbd.o
nbd-client 192.168.1.2 1077 1078 192.168.2.2 1079 1080 /dev/nda
In the current drivers, the NBD presents itself as a partitioned block device nda, although the “partitions” are not used in a standard way. Their device files nda1, nda2 and so on are used as kernel communication channels by the subordinate client daemons. They provide the redundancy and increased bandwidth in the device. The whole-device file nda is the only one that accepts the standard block-device operations.

Driver and Protocol Overview

On insertion of the kernel module, the driver registers with the kernel. As the client daemon connects for the first time to its server counterpart, the original driver hands the file descriptor of the socket to the kernel . Kernel traces the descriptor back to the internal kernel socket structure and registers the memory address in its own internal structures for subsequent use. Our current drivers keep the networking in user space and do not register the socket.

The client daemons and server daemons then perform a handshake routine. No other setup is required, but the handshake may establish an SSL channel in the current generation of drivers, which requires SSL certificates and requests to be set up beforehand.

Pavel's original driver code comprised two major threads within the kernel. The “client” thread belongs to the client daemon. The daemon's job is to initiate the network connection with the server daemon on the remote machine, and hand down to the kernel via an ioctl call the socket it has opened. The client daemon then sits blocked user-side in an ioctl call while its thread of execution continues forevermore within the kernel. It loops continuously transferring data across the network socket from within the kernel. Terminating the daemon requires terminating the socket too, or the client daemon will remain stuck in the loop inside the kernel ioctl.

A “kernel” thread enters the driver sporadically as a result of pressures on the local machine. Imagine that echo hello >! /dev/nd0 is executed (the block-device names for the original driver are nd0-127, and they take major number 43). The echo process will enter the kernel through the block device layers, culminating in a call to the registered block-device request handler for a write to the device. The kernel handler for NBD is the function nbd_request. Like all block-device request handlers, nbd_request performs a continuous loop while(req = CURRENT), CURRENT being the kernel macro that expands to the address of the write request struct. After treating the request, the driver moves the pointer on with CURRENT = CURRENT->next and loops.

The kernel thread's task is to do the following:

  • Link the request req = CURRENT to the front of the pending transfer list.

  • Embed a unique identifier and emit a copy across the network to the server daemon at the other side of the network socket.

The unique identifier is the memory address of the request req. It is unique only while the request has not yet been serviced, but that is good enough. (When the driver used to crash through the mysterious corruption we were never able to pin down, the crash was often associated with duplicated entries and a consequently circular list, which may be a clue.)

On the other side of the network, the server daemon receives a write request, writes “hello” to its local resource, and transmits an acknowledgement to the client containing the unique identifier of the request.

The client daemon thread on the local machine is in its loop, blocked inside the kernel on a read from the socket, waiting for data to appear. Its task is now to do the following:

  • Recognize the unique identifier in the acknowledgement, comparing it with the oldest (last) element req in the linked list of partially completed requests.

  • Unlink the request req from the list of incompletes, and tell the block layers to discard the structure via a call to end_request.

This protocol requires that the acknowledgement received be for the request pending on the tail of the driver's internal list, while new requests from the kernel are added to the head. TCP can guarantee this because of the sequential nature of the TCP stream. Even a single missed packet will break the current driver, but it will also mean the TCP socket is broken. The socket will return an error in this situation. That error message allows the driver to disengage gracefully.

Figure 4.

Kernel networking vs. user-space networking in an NBD. User-space networking requires an extra copy and other overheads, but affords much greater flexibility. The overhead can be offset by transferring multiple requests at a time.

The client-side control flow in the original kernel driver is shown schematically on the left side of Figure 4. The black rectangle represents a request. It is linked into the device's request queue by a kernel thread and is then swept up in the client daemon's perpetual loop within the kernel. The client thread performs networking within the kernel. In the drivers we have subsequently developed, we have come to favour user-side networking, in which the client thread deals user-side with a copy of the request transferred from within the kernel. It dives repeatedly into the kernel to copy across the data, then transmits it in standard network code. The overheads are much greater, but the flexibility is also much greater. The overhead can be ameliorated by transferring multiple requests across at a time, and our current drivers do this. Normally, 10 to 20 requests of one block each will be transferred in each visit to the kernel. The cost of copying between kernel and user space cannot, however, be avoided. Multiple client daemons contend for the kernel requests as the clients become free, transferring them across the network through possibly distinct routes and physical devices. The situation is depicted in Figure 4. Each client daemon handles one channel, but will mediate any request. The channels provide redundancy, resilience and bandwidth.

Figure 5.

Multiple client daemons capture kernel requests in the current NBD drivers, providing redundancy and load balancing through demand multiplexing across several distinct network channels.

Table 3

The complete data protocol sequence “on the wire” is shown in Table 3. Note that the unique ID is 64-bit, so it may use the request's memory address as the identifier on a 64-bit architecture. Curiously, the requested data offset and length are 32-bit byte offsets in the original driver, although they are calculated from sector numbers (sectors are 512 bits each) which might well have been used instead. This is a hidden 32-bit limitation in the original NBD. Our versions implement 64-bit limits on a 32-bit file system or machine architecture. The server daemon has been modified to multiplex requests beyond 32 bits among several distinct resource files or devices.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

throughput and availability

Anonymous's picture

Sorry, got thrown off by those dd 's. Reading the summary, it is clear: you actually mount /mnt/remote, it doesn't matter whatever it is. But then, loks like you need to know the filesystem, when you mount it on /nd0 on the client!

throughput and availability

Anonymous's picture

"so that we can route through a second network card on both machines and thus double the available bandwidth through our switched network."

Only if you are limited by the throughput of your NIC (say your network card is an old 10MBit/sec and your network can do 100MBit or better; otherwise "switched" means switched: you wait on one another, and what you gain is just this synchronization overhead.

The other thing that may be missing: how do you "publish" a partition you already have? I did not get it: the "remote" resource has to be dedicated only to the NBD, or can it be still available locally on the server? In other words, say, can I "publish" the /home directory ? Does it have to be a standalone partition then? or may I just publish some part of it, like through NFS or samba?

hi!!!I hope i can learn more

aineed's picture

hi!!!I hope i can learn more about operating system..

The NBD document is really

Anonymous's picture

The NBD document is really good.
But I have some doubts. Can you help me in understanding ,how NBD is different from iSCSI device?
Which will be better for performance NBD or iSCSI?

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState