In a world that appears to be governed by Murphy's Laws, anything can go wrong. Accidents ranging from machine crashes, media failures, operator errors and random data corruption to such catastrophes as floods and earthquakes can result in lost data, both temporarily and permanently. The risk of data accidents cannot be eliminated, so one must plan to minimize it. There are certain known techniques for this.
Users can tolerate varying intervals of loss of data availability. An e-business may not tolerate more than a few minutes of down time, but a user may tolerate lack of access to stored clip art for a week. Data becomes unavailable when any critical component fails, such as power, processor, memory, or disk. If a copy of the data is available, it may be possible to restore access to data from that copy. If the copy is off-line, such as on tape, it may take several minutes or hours to restore it to disk. If the copy is on disk, access can be failed over in seconds or minutes. If no data copy is available, and it is not possible to reconstruct the data, we have permanent data loss.
The key to protecting data is to have more than one copy. As data keeps changing, the copy must change accordingly. Disk mirroring is a technique that keeps the data copy always up to date. Mirroring works well if both disks are equally fast. If one disk is connected over a network, however, the application slows down because it must wait until the data updates are completed on both disks.
Keeping a copy at a distance somewhere off-site helps data survive accidents that cause large-scale damage. This technique is called replication, discussed next.
Replication is a technique of maintaining identical physical copies or replicas of a master set of data at two or more geographically separate locations. The most common data replication techniques are point-in-time copies and real-time copies. Point-in-Time copies involve capturing snapshots of any critical data and storing them safely at a remote location. Snapshots can be taken on tape, and in the event of a disaster, data is restored from the tape. This technique has several drawbacks, though; the recovered data typically is 24 hours to a few days old. Also, the time needed for a snapshot capture, as well as that for recovery is lengthy, causing longer application outages.
Real-time copy propagates updates to the copy immediately or soon after they are applied to the original set of data. These techniques fall into two categories.
Synchronous Replication: duplicates every write over several disks or volumes and blocks the original write until all updates have been completed successfully. Therefore, the performance impact on the application is directly proportional to network speeds and distance. It thus can be used only over small distances and is practical only for fast and reliable local networks. This practice commonly is employed by enterprises that cannot afford any data loss, such as banks and stock exchanges.
Asynchronous Replication: Snapshots and synchronous replication fall on opposite ends of the replication time versus recovery time spectrum. Asynchronous replication compromises some timeliness of data for higher performance and minimal application impact. It decouples application writes from replication writes. Application writes return soon after the replicator has logged them. The main advantage is the imperceptible impact on the application performance; hence, replication can take place over larger distances. A major challenge in asynchronous replication, though, is the write sequencing, write order fidelity problem.
Pratima (meaning reflection or image in Sanskrit) provides block-level, real-time replication of one or more block devices on a client computer. The devices are replicated to a server computer. A local device (say /dev/sda4) is placed under control of Pratima, which then offers access through its own block device (say /dev/srr0). In addition to hard disk partitions, any underlying block device, such as a logical volume manager (LVM) device, can be replicated.
Pratima provides methods for initial synchronization, fast on-line resynchronization and automatic reconnection. The product also supports chaining for higher flexibility and reliability.
Pratima software components run on both client and server computers. A Pratima device driver captures updates on the client computer, and a dæmon on the server computer receives replication data over the network and writes it down to replica devices.
The client module is a stacked device driver interposed below the filesystem and above the storage device driver. The driver exports a block interface and can be accessed through system calls, including open, close, read, write and stat. Additionally, it supports ioctls for such control operations as enable, disable, clean, kill, reconnect and status.
Figure 1: Client Side Data Flow for Asynchronous Replication Mode
The server side listener module basically is a user-space daemon that passively waits for client side packets to arrive. These packets correspond to the different system calls and ioctls the client interface supports. For example, an enable ioctl corresponds to an enable packet. Upon receiving one, the server enables the remote volume for replication. Similarly, if a write packet arrives, it writes it down to the remote volume and returns the success status of this operation to the client.
Figure 2: Server Side Data Flow
A filesystem may be mounted on a replicator device and initially synchronized with the remote volume. Once the local and remote volumes are in sync, reads and writes are directed to the replicator device driver. It treats reads as transparent and passes them on to the underlying device driver. On the other hand, all incoming writes are bifurcated. The block number is recorded in memory and also replicated on the remote server. The write then is passed to its respective driver; if successful, it is queued and sent over to the remote server.
The design for Pratima had to address several interesting issues, which are described below.
1. Write Order Fidelity
Write order fidelity means the writes on the replicated device must be applied in exactly the same order as on the original device. If this ordering is not preserved, the replica may not be usable. FIFO queues containing private data buffers had to be used to provide write order fidelity.
2. Block Number Logging for Fast Resynchronization
What if the network or server fails for a while, but the client computer is functional? The Pratima driver queues some number of block writes, but if the buffers cannot be flushed out to the replica, the queue fills up. Now, it is not desirable to block the application until the server becomes accessible. The driver gives up at this point, allowing the replica to go out of sync.
Bringing the replica back in sync can be painful and generally requires stopping the application. Block number logging can be used to speed up resynchronization. All the block numbers of blocks to be written are logged to disk. Then, resynchronization is accomplished quickly by replicating only the logged blocks. However, logging block numbers consumes local device bandwidth.
My solution is based on the reasonable assumption that the client machine never undergoes transient failures. This solution uses an in-memory list called the block write table (BWT), in which only the block numbers for all in-flight writes are stored. Thus, if a network outage causes the queue to overflow and loose write data, we can read these blocks from the local volume and replicate them as soon as possible.
3. Recovery and Fail Over
If client machine crashes, we lose the queue and the block write table. The client has to undergo a complete synchronization to make all replicas consistent.
If the server machine or the network suffers a transient failure, we then can use the block write table (BWT) on the client side for resynchronization. However if the server or network outage is long enough to overflow the BWT, the situation cannot be saved. Complete synchronization is required before replication can be restarted.
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
|Non-Linux FOSS: Seashore||May 10, 2013|
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Dynamic DNS—an Object Lesson in Problem Solving
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Roll your own dynamic dns
3 hours 26 min ago
- Please correct the URL for Salt Stack's web site
6 hours 37 min ago
- Android is Linux -- why no better inter-operation
8 hours 52 min ago
- Connecting Android device to desktop Linux via USB
9 hours 21 min ago
- Find new cell phone and tablet pc
10 hours 19 min ago
11 hours 48 min ago
- Automatically updating Guest Additions
12 hours 56 min ago
- I like your topic on android
13 hours 43 min ago
- This is the easiest tutorial
20 hours 19 min ago
- Ahh, the Koolaid.
1 day 1 hour ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?