The Network Block Device

A network block device (NBD) driver makes a remote resource look like a local device in Linux, allowing a cheap and safe real-time mirror to be constructed.

Table 4

The goal of our industrial partners in the driver development was to obtain improvements in areas which are important to an industrial-strength setting (see Table 4).

For example, the original driver broke irretrievably when the network link broke, or when a server or client daemon died. Even if reconnected, an I/O error became visible to the higher layers of, for example, an encompassing RAID stack, which resulted in the NBD component being faulted off-line. We have worked to increase robustness by making externally observable failure harder to induce, letting the driver deal internally with any transient problems. The current drivers log transactions at the block level. A spoiled transaction is “rolled back” and resubmitted by the driver. Doubly completed transactions are not considered an error, the second completion being silently discarded.

A transaction spoils if it takes too long to complete (“request aging”), if its header or data becomes visibly corrupted or if the communication channel errors out between initiation and completion of the transaction. To keep their image of the link state up to date, the drivers send short keep-alives in quiet intervals. Data can still be lost by hitting the power switch, but not by breaking the network and subsequently reconnecting it, nor by taking down only one or two of the several communication channels in the current drivers. The drivers have designed-in redundancy. Redundancy here means the immediate failover of a channel to a reserve and provision for the multiple transmission and reception of the same request.

There are many ways in which connectivity can be lost, but differently routed network channels survive at least some of them. The daemon client threads (one per channel) demand-multiplex the communications among themselves according to demand and their own availability (kernel-based round-robin multiplexing has also been tried). Under demand-multiplexing, the fastest working client threads bid for and get most of the work, and a stopped or non-working thread defaults its share of the workload to the working threads. In conjunction with journaled transactions and request aging, this seems very effective. Channels automatically “failover” to a working alternative.

Speed is also an important selling point in industry. Our client-daemon threads are fully asynchronous for maximum speed. They are not synchronized via locks or semaphores inside the kernel (apart from the one semaphore atomicising accesses to the NBD major's request queue). This means kernel requests may be treated out of order at various stages of their lifetime, so the driver has had to be revised carefully to meet this requirement. In particular, the standard Linux kernel singly linked request list has been converted to a doubly linked list, appropriating an otherwise unused pointer field. The driver can then “walk the queue” more easily to search for a match in the out-of-order case. The kernel really should implement a doubly linked request-queue structure. The main brake on speed is probably the indiscriminate wakeup of all client threads on arrival of one kernel request, but when the requests come in faster than the daemons can send them, that overhead disappears.

Although failover of a network channel may be unobserved from outside the driver, it will degrade communications in some senses of the word. So, our current daemon implementations reconnect and renegotiate in an attempt to restore a failed channel. If the daemons themselves die, then they are automatically restarted by a guardian daemon, and the kernel driver will allow them to replace the missing socket connection once the authentication protocol is repeated successfully. Upon reconnection, any pending read/write requests unblock if they have not already been rolled back and resubmitted via another channel. If the device is part of a RAID stack, the outage is never noticed.

Connections between remote and local machines, when multiplexed over several ports as in our drivers, can be physically routed across two different network interfaces, doubling the available bandwidth. Routing across n NICs multiplies available bandwidth by n until the CPU is saturated. Since streaming through NBD on a single 3c905 NIC loads a Pentium 200MHz CPU to about 15% on a 100BT network, a factor of about four to five in increased bandwidth over a single NIC may be available (we have not been able to test for the saturation level, lacking sufficient PCI slots to do so).

Another important industrial requirement is absolute security of the communication channels. We have chosen to pass all communication through an external SSL connection, as it hives off the security aspect to the SSL implementation which we trust. That decision required us to move the networking code entirely out of the kernel and into the client daemons, which are linked with the openSSL code. Moving the code user-side slows the protocol, and it is slowed still further by the SSL layers. The fastest (non-trivial) SSL algorithm drops throughput by 50%, the slowest by 80%. An alternative is to pass the communications link in-kernel through a cipe tunnel, or to site either the server or client daemon on an encrypted file system, but we have not experimented with these possibilities.

SSL offers us the mechanism with the best understood security implications and we have plumped for it. As fallback, a primitive authentication protocol is provided in the driver itself—tokens are exchanged on first connection, and reconnects require the tokens to be presented again. The daemons wrap this exchange and the whole session in the openSSL authentication mechanisms, so it is hardly ever required. Compressing the data as it passes across the link would also provide greater bandwidth (as well as perhaps security), but so far we have not implemented it. It can be applied in the user-side daemon codes.

Whatever limits may or may not be present in the rest of the code, the server daemons are constrained on 32-bit architectures by the 32-bit implementation of EXT2 and other file systems. They can seek up to only a 31-bit (32-bit) offset, which limits the size of the served resource to 2GB (4GB). Our server daemons make RAID linear or striping mode available, however. In these modes, they serve from two or more physical resources, each of which may be up to 2GB in size, to make a total resource size available that breaks the 32-bit barrier by any amount required. Striping seems particularly effective in terms of increased bandwidth, for unknown reasons. The daemons can also serve directly from a raw block device. In those circumstances, it is not known to us if there is a 32-bit limitation in the kernel code—presumably not.

Listing 1

A /proc/nbdinfo entry provides us with details of the driver state, the number of requests handled, errored and so on. The page is read-only (see Listing 1).

Adjustments to default driver parameters, such as the interval between keep-alives and the number of blocks read-ahead, are made via module options on loading the driver. There is currently no mechanism to change parameters at other times. A writable \proc entry or a serial control device is probably desirable. Some degree of control is vested in the client guardian daemon and subordinate client daemons, which send special ioctls to the device on receipt of particular signals. The USR2 signal, for example, triggers an ioctl that errors out all remaining requests, ejects client-daemon threads from the kernel and puts the driver in a state in which the module code can be safely removed from the running kernel. It is, of course, a mechanism intended to aid debugging. The disadvantage of this kind of approach is that it requires a client daemon to be running before control can be exercised. In the future, we intend to implement the /proc-based writable interface. The use of fake partition information in the device is also an appealing route towards obtaining better reports and increased control.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

throughput and availability

Anonymous's picture

Sorry, got thrown off by those dd 's. Reading the summary, it is clear: you actually mount /mnt/remote, it doesn't matter whatever it is. But then, loks like you need to know the filesystem, when you mount it on /nd0 on the client!

throughput and availability

Anonymous's picture

"so that we can route through a second network card on both machines and thus double the available bandwidth through our switched network."

Only if you are limited by the throughput of your NIC (say your network card is an old 10MBit/sec and your network can do 100MBit or better; otherwise "switched" means switched: you wait on one another, and what you gain is just this synchronization overhead.

The other thing that may be missing: how do you "publish" a partition you already have? I did not get it: the "remote" resource has to be dedicated only to the NBD, or can it be still available locally on the server? In other words, say, can I "publish" the /home directory ? Does it have to be a standalone partition then? or may I just publish some part of it, like through NFS or samba?

hi!!!I hope i can learn more

aineed's picture

hi!!!I hope i can learn more about operating system..

The NBD document is really

Anonymous's picture

The NBD document is really good.
But I have some doubts. Can you help me in understanding ,how NBD is different from iSCSI device?
Which will be better for performance NBD or iSCSI?