In a world that appears to be governed by Murphy's Laws, anything can go wrong. Accidents ranging from machine crashes, media failures, operator errors and random data corruption to such catastrophes as floods and earthquakes can result in lost data, both temporarily and permanently. The risk of data accidents cannot be eliminated, so one must plan to minimize it. There are certain known techniques for this.
Users can tolerate varying intervals of loss of data availability. An e-business may not tolerate more than a few minutes of down time, but a user may tolerate lack of access to stored clip art for a week. Data becomes unavailable when any critical component fails, such as power, processor, memory, or disk. If a copy of the data is available, it may be possible to restore access to data from that copy. If the copy is off-line, such as on tape, it may take several minutes or hours to restore it to disk. If the copy is on disk, access can be failed over in seconds or minutes. If no data copy is available, and it is not possible to reconstruct the data, we have permanent data loss.
The key to protecting data is to have more than one copy. As data keeps changing, the copy must change accordingly. Disk mirroring is a technique that keeps the data copy always up to date. Mirroring works well if both disks are equally fast. If one disk is connected over a network, however, the application slows down because it must wait until the data updates are completed on both disks.
Keeping a copy at a distance somewhere off-site helps data survive accidents that cause large-scale damage. This technique is called replication, discussed next.
Replication is a technique of maintaining identical physical copies or replicas of a master set of data at two or more geographically separate locations. The most common data replication techniques are point-in-time copies and real-time copies. Point-in-Time copies involve capturing snapshots of any critical data and storing them safely at a remote location. Snapshots can be taken on tape, and in the event of a disaster, data is restored from the tape. This technique has several drawbacks, though; the recovered data typically is 24 hours to a few days old. Also, the time needed for a snapshot capture, as well as that for recovery is lengthy, causing longer application outages.
Real-time copy propagates updates to the copy immediately or soon after they are applied to the original set of data. These techniques fall into two categories.
Synchronous Replication: duplicates every write over several disks or volumes and blocks the original write until all updates have been completed successfully. Therefore, the performance impact on the application is directly proportional to network speeds and distance. It thus can be used only over small distances and is practical only for fast and reliable local networks. This practice commonly is employed by enterprises that cannot afford any data loss, such as banks and stock exchanges.
Asynchronous Replication: Snapshots and synchronous replication fall on opposite ends of the replication time versus recovery time spectrum. Asynchronous replication compromises some timeliness of data for higher performance and minimal application impact. It decouples application writes from replication writes. Application writes return soon after the replicator has logged them. The main advantage is the imperceptible impact on the application performance; hence, replication can take place over larger distances. A major challenge in asynchronous replication, though, is the write sequencing, write order fidelity problem.
Pratima (meaning reflection or image in Sanskrit) provides block-level, real-time replication of one or more block devices on a client computer. The devices are replicated to a server computer. A local device (say /dev/sda4) is placed under control of Pratima, which then offers access through its own block device (say /dev/srr0). In addition to hard disk partitions, any underlying block device, such as a logical volume manager (LVM) device, can be replicated.
Pratima provides methods for initial synchronization, fast on-line resynchronization and automatic reconnection. The product also supports chaining for higher flexibility and reliability.
Pratima software components run on both client and server computers. A Pratima device driver captures updates on the client computer, and a dæmon on the server computer receives replication data over the network and writes it down to replica devices.
The client module is a stacked device driver interposed below the filesystem and above the storage device driver. The driver exports a block interface and can be accessed through system calls, including open, close, read, write and stat. Additionally, it supports ioctls for such control operations as enable, disable, clean, kill, reconnect and status.
Figure 1: Client Side Data Flow for Asynchronous Replication Mode
The server side listener module basically is a user-space daemon that passively waits for client side packets to arrive. These packets correspond to the different system calls and ioctls the client interface supports. For example, an enable ioctl corresponds to an enable packet. Upon receiving one, the server enables the remote volume for replication. Similarly, if a write packet arrives, it writes it down to the remote volume and returns the success status of this operation to the client.
Figure 2: Server Side Data Flow
A filesystem may be mounted on a replicator device and initially synchronized with the remote volume. Once the local and remote volumes are in sync, reads and writes are directed to the replicator device driver. It treats reads as transparent and passes them on to the underlying device driver. On the other hand, all incoming writes are bifurcated. The block number is recorded in memory and also replicated on the remote server. The write then is passed to its respective driver; if successful, it is queued and sent over to the remote server.
The design for Pratima had to address several interesting issues, which are described below.
1. Write Order Fidelity
Write order fidelity means the writes on the replicated device must be applied in exactly the same order as on the original device. If this ordering is not preserved, the replica may not be usable. FIFO queues containing private data buffers had to be used to provide write order fidelity.
2. Block Number Logging for Fast Resynchronization
What if the network or server fails for a while, but the client computer is functional? The Pratima driver queues some number of block writes, but if the buffers cannot be flushed out to the replica, the queue fills up. Now, it is not desirable to block the application until the server becomes accessible. The driver gives up at this point, allowing the replica to go out of sync.
Bringing the replica back in sync can be painful and generally requires stopping the application. Block number logging can be used to speed up resynchronization. All the block numbers of blocks to be written are logged to disk. Then, resynchronization is accomplished quickly by replicating only the logged blocks. However, logging block numbers consumes local device bandwidth.
My solution is based on the reasonable assumption that the client machine never undergoes transient failures. This solution uses an in-memory list called the block write table (BWT), in which only the block numbers for all in-flight writes are stored. Thus, if a network outage causes the queue to overflow and loose write data, we can read these blocks from the local volume and replicate them as soon as possible.
3. Recovery and Fail Over
If client machine crashes, we lose the queue and the block write table. The client has to undergo a complete synchronization to make all replicas consistent.
If the server machine or the network suffers a transient failure, we then can use the block write table (BWT) on the client side for resynchronization. However if the server or network outage is long enough to overflow the BWT, the situation cannot be saved. Complete synchronization is required before replication can be restarted.
- VMware's Clarity Design System
- On Your Marks, Get Set...Gutsy Gibbon!
- Let's Go to Mars with Martian Lander
- Applied Expert Systems, Inc.'s CleverView for TCP/IP on Linux
- Papa's Got a Brand New NAS
- My Childhood in a Cigar Box
- Panther MPC, Inc.'s Panther Alpha
- Rogue Wave Software's TotalView for HPC and CodeDynamics
- Simplenote, Simply Awesome!
- GENIVI Alliance's GENIVI Vehicle Simulator