In a world that appears to be governed by Murphy's Laws, anything can go wrong. Accidents ranging from machine crashes, media failures, operator errors and random data corruption to such catastrophes as floods and earthquakes can result in lost data, both temporarily and permanently. The risk of data accidents cannot be eliminated, so one must plan to minimize it. There are certain known techniques for this.
Users can tolerate varying intervals of loss of data availability. An e-business may not tolerate more than a few minutes of down time, but a user may tolerate lack of access to stored clip art for a week. Data becomes unavailable when any critical component fails, such as power, processor, memory, or disk. If a copy of the data is available, it may be possible to restore access to data from that copy. If the copy is off-line, such as on tape, it may take several minutes or hours to restore it to disk. If the copy is on disk, access can be failed over in seconds or minutes. If no data copy is available, and it is not possible to reconstruct the data, we have permanent data loss.
The key to protecting data is to have more than one copy. As data keeps changing, the copy must change accordingly. Disk mirroring is a technique that keeps the data copy always up to date. Mirroring works well if both disks are equally fast. If one disk is connected over a network, however, the application slows down because it must wait until the data updates are completed on both disks.
Keeping a copy at a distance somewhere off-site helps data survive accidents that cause large-scale damage. This technique is called replication, discussed next.
Replication is a technique of maintaining identical physical copies or replicas of a master set of data at two or more geographically separate locations. The most common data replication techniques are point-in-time copies and real-time copies. Point-in-Time copies involve capturing snapshots of any critical data and storing them safely at a remote location. Snapshots can be taken on tape, and in the event of a disaster, data is restored from the tape. This technique has several drawbacks, though; the recovered data typically is 24 hours to a few days old. Also, the time needed for a snapshot capture, as well as that for recovery is lengthy, causing longer application outages.
Real-time copy propagates updates to the copy immediately or soon after they are applied to the original set of data. These techniques fall into two categories.
Synchronous Replication: duplicates every write over several disks or volumes and blocks the original write until all updates have been completed successfully. Therefore, the performance impact on the application is directly proportional to network speeds and distance. It thus can be used only over small distances and is practical only for fast and reliable local networks. This practice commonly is employed by enterprises that cannot afford any data loss, such as banks and stock exchanges.
Asynchronous Replication: Snapshots and synchronous replication fall on opposite ends of the replication time versus recovery time spectrum. Asynchronous replication compromises some timeliness of data for higher performance and minimal application impact. It decouples application writes from replication writes. Application writes return soon after the replicator has logged them. The main advantage is the imperceptible impact on the application performance; hence, replication can take place over larger distances. A major challenge in asynchronous replication, though, is the write sequencing, write order fidelity problem.
Pratima (meaning reflection or image in Sanskrit) provides block-level, real-time replication of one or more block devices on a client computer. The devices are replicated to a server computer. A local device (say /dev/sda4) is placed under control of Pratima, which then offers access through its own block device (say /dev/srr0). In addition to hard disk partitions, any underlying block device, such as a logical volume manager (LVM) device, can be replicated.
Pratima provides methods for initial synchronization, fast on-line resynchronization and automatic reconnection. The product also supports chaining for higher flexibility and reliability.
Pratima software components run on both client and server computers. A Pratima device driver captures updates on the client computer, and a dæmon on the server computer receives replication data over the network and writes it down to replica devices.
The client module is a stacked device driver interposed below the filesystem and above the storage device driver. The driver exports a block interface and can be accessed through system calls, including open, close, read, write and stat. Additionally, it supports ioctls for such control operations as enable, disable, clean, kill, reconnect and status.
Figure 1: Client Side Data Flow for Asynchronous Replication Mode
The server side listener module basically is a user-space daemon that passively waits for client side packets to arrive. These packets correspond to the different system calls and ioctls the client interface supports. For example, an enable ioctl corresponds to an enable packet. Upon receiving one, the server enables the remote volume for replication. Similarly, if a write packet arrives, it writes it down to the remote volume and returns the success status of this operation to the client.
Figure 2: Server Side Data Flow
A filesystem may be mounted on a replicator device and initially synchronized with the remote volume. Once the local and remote volumes are in sync, reads and writes are directed to the replicator device driver. It treats reads as transparent and passes them on to the underlying device driver. On the other hand, all incoming writes are bifurcated. The block number is recorded in memory and also replicated on the remote server. The write then is passed to its respective driver; if successful, it is queued and sent over to the remote server.
The design for Pratima had to address several interesting issues, which are described below.
1. Write Order Fidelity
Write order fidelity means the writes on the replicated device must be applied in exactly the same order as on the original device. If this ordering is not preserved, the replica may not be usable. FIFO queues containing private data buffers had to be used to provide write order fidelity.
2. Block Number Logging for Fast Resynchronization
What if the network or server fails for a while, but the client computer is functional? The Pratima driver queues some number of block writes, but if the buffers cannot be flushed out to the replica, the queue fills up. Now, it is not desirable to block the application until the server becomes accessible. The driver gives up at this point, allowing the replica to go out of sync.
Bringing the replica back in sync can be painful and generally requires stopping the application. Block number logging can be used to speed up resynchronization. All the block numbers of blocks to be written are logged to disk. Then, resynchronization is accomplished quickly by replicating only the logged blocks. However, logging block numbers consumes local device bandwidth.
My solution is based on the reasonable assumption that the client machine never undergoes transient failures. This solution uses an in-memory list called the block write table (BWT), in which only the block numbers for all in-flight writes are stored. Thus, if a network outage causes the queue to overflow and loose write data, we can read these blocks from the local volume and replicate them as soon as possible.
3. Recovery and Fail Over
If client machine crashes, we lose the queue and the block write table. The client has to undergo a complete synchronization to make all replicas consistent.
If the server machine or the network suffers a transient failure, we then can use the block write table (BWT) on the client side for resynchronization. However if the server or network outage is long enough to overflow the BWT, the situation cannot be saved. Complete synchronization is required before replication can be restarted.
I now describe how Pratima should be used for replicating your valuable data. Before replication begins, the source and target data objects must be identical. Making source and target objects identical is called initial synchronization. Where large amounts of data are replicated--hundreds or thousands of gigabytes--initial synchronization is a significant problem. Steady-state data update rate may be relatively low--a few tens of kilobytes per second or less--but a large amount of data may be lost.
There are many ways to handle the replication, as outline below:
1. Read every block from local device and write it to a remote replica. This can be done intelligently by checking which blocks need to be written and writing only those.
2. Backup the entire volume on tape drive, ship it to the remote site and restore it on the remote server.
3. Backup the entire volume and restore it locally, using Pratima to replicate these restoration writes.
4. If the system administrator is planning to make a filesystem using the mkfs utility on a raw local volume, then when using a Pratima Device, the writes of the mkfs also are replicated; hence, no special synchronization is necessary.
5. The synchronization utility provided in the package provides two modes of initial synchronization: dumb and intelligent.
Dumb resynchronization involves blind copying all the block data on the local device and transferring it to the remote device over the network. The time necessary for completion is directly proportional to the size of the local volume.
Intelligent full resynchronization involves checking which blocks of data actually need to be written. This is done by reading a chunk of data, calculating its 32-bit CRC checksum and querying the server whether the block needs to be replicated. This technique is faster when the amount of change in the local volume and replica is small.
There is a single configuration file on client machine for all devices that are to be replicated. Entries in the file set up associations between source devices on the client (say /dev/sda4), the Pratima devices (say /dev/srr0) and the targets on the server machines. Tuning parameters also are defined in the configuration file.
When the association between local and remote volumes is set up through an enable utility, you can mount a file system on the Pratima device (/dev/srr). For example,
# mount /dev/srr3 /mnt/my-home
replicates writes coming to /mnt/my-home to the remote device associated with /dev/srr3.
Pratima comes with a graphical user interface for administrating and monitoring all replicator devices. It provides up-to-date information and statistics for a selected device, such as state, local and remote volumes, queue size, block table size and I/O stats.
Figure 3 shows the Pratima administration window. Notice the status of all the devices shown in the QuickView panel.
Figure 3: Screenshot of the Graphical User Interface
Write performance of three different types of client block devices was measured with and without Pratima replication for asynchronous mode. The following test program generated sequential writes and measured the total time taken:
Root #> time dd if=/dev/input of=/dev/output bs=1024k count=1000
The running times of the command were recorded for different test cases, averaged and converted to MBps. The results are charted in Figure 4:
Figure 4: Write Throughput of Pratima in Asynchronous Mode
It is apparent that replication in asynchronous mode causes a small performance loss in throughput.
A network block device (NBD) makes a remote resource look like a local device, allowing a cheap, safe and real-time mirror to be constructed. One can implement a remote replicator simply by making an NBD device part of a mirrored volume to get real-time remote mirroring.
Comparison of NBD and Pratima
|Network Block Device (NBD)||Pratima Asynchronous Remote Volume Replicator for Linux|
|Supports only synchronous mode||Supports both synchronous and asynchronous modes|
|Must be used in conjunction with Logical Volume Manager||Can work with any block device, such as raw volumes, RAID, devices created by a Logical Volume Manager|
|Provides no methods to synchronize the data at the Secondary location||An intelligent and dumb mode of synchronization is provided along with many other initial syncing mechanisms.|
|No failover or resync mechanisms||Recovery and failover are very simple with fast on-line resync|
Several enhancements were considered but were not implemented in the product; they are described briefly here. Pratima currently supports only point-to-point replication, but it could be enhanced to support fan-out replicas. The local device must be unmounted during full synchronization. This can be avoided by fencing areas of the device during synchronization. The current configuration file format is simple but inflexible. It could be enhanced for better readability and flexibility.
Pratima is an open-source project available freely under the GNU Public License (GPL). It involves kernel-mode components for high performance. It can be controlled using from both the command line and a graphical interface. It contains complete documentation in the form of man pages. Considering that Pratima was created as a graduate student project, it is fairly complete, although it has not been torture tested, nor run on high performance servers.
Shared Data Clusters by Dilip Ranade
Resilient Enterprise by Massiglia & Marcus
Blueprints for High Availability by Marcus & Stern
Sandeep Ranade (www.geocities.com/sandeep_d_ranade) is a 22-year-old computer engineer and software developer working in filesystems and storage technologies for Calsoft Private Limited.