Economical Fault-Tolerant Networks
In the past decade or so, we have seen the evolution of computer networks as a merger of the communication and computing worlds. This revolution has made the dream of global connectivity possible, and with it has brought an explosion in the number of people utilizing computer networks for numerous vital tasks. The sheer volume and importance of tasks being ported to network-based applications and services has made reliability an undeniable need. On the other hand, networks fail all the time for a variety of reasons, including but not limited to power failures, communication link breakdowns, hardware failures and software crashes. Overcoming these failures and maintaining network services to users is addressed by implementing fault tolerance.
One common method for implementing fault tolerance is redundant data replication. Employment of an exact copy in place of a failed network component ensures availability of services on the network. The process of fault detection and replacement must be quick and automatic in order to make the fault invisible at the client end.
Our solution bases itself on a cluster of identical Linux servers, providing various network services. We did not employ any additional links other than the standard network connectivity. Various components within the solution include an algorithm to select a successor to a failed component, and both clock and data synchronization procedures among the processes. The services and data are replicated on all computers within the cluster. In the event of failure of a server, it is to be replaced by a replica, which carries a redundant copy of all data and service configurations offered by the crashed machine. After the detection of a failure, an election process is enacted among the identical servers and a replacement for the crashed server comes forward. Under normal working conditions of the network, the current server regularly pushes clock time, data and key configuration files to the rest of the cluster in order to maintain them as peers.
The standard techniques for improving the reliability of the network can be divided into two broad groups.
Dedicated hardware: this is expensive, and needs special maintenance. Examples of such a method include RAID (redundant array of inexpensive disks) and non-IP solutions requiring secondary communication links between machines.
Secondary backup machines: these are machines identical to the primary server and are maintained as mirrors of the primary. Their major drawback is not their high cost, but that under conditions of network failure, the switchover to the backup machine is not transparent to the clients. In fact, the clients need to be configured in advance with secondary (or slave) server parameters. The clients for most of the services will usually try the primary server first; then, on timeout, attempt to use the secondary server. This introduces unnecessary overhead and delay. Another important thing to note is that the backup machine is 100 percent redundant, used only when the primary server is under fault.
The problem is in achieving a software solution to ensure reliability, without the need of additional hardware and to keep the switchover transparent to the clients. Furthermore, we would not like to waste resources by completely dedicating one or more machines to a backup role.
Our solution finds its roots in the already-established backup system of primary and secondary servers. Instead of using two computers, we employ a set of computers. One of these computers is selected to act as a master machine to coordinate the network services, while the rest assume the role of slave machines: general-purpose server-class machines that take part in the election. The master server sets up a virtual IP address and starts all services required for normal operation of the network. The slaves monitor the status of the master for failures. If a failure occurs, a new master will be chosen and services to the network restored; the change is transparent to the clients. The slaves are not intended to be dedicated solely to this purpose; rather, they may be employed to perform other tasks (compute servers, workstations, etc.). Only on being chosen master does a machine also take on the task of establishing the virtual server.
The most crucial task in the event of failure is the selection of a new master machine. It is not possible to predict which machines in the chain would be unavailable due to earlier failures or being switched off. Therefore, a simple pre-established hierarchy of switchover, like the one in primary/secondary servers, is not practical. Moreover, a single coordinator for selecting a master cannot be used, because failure of this coordinator would result in total failure of the network.
With these facts in mind, we used a distributed election algorithm to select the appropriate successor to the dead master. Election algorithms are widely used to select a coordinator process in parallel and distributed computing. Initially, we tried the “Bully Election” algorithm. (See Resources.) This suffers from a handicap in that when a failed master is fixed and brought back into the network, it bullies the already-established master into handing over services. This creates an unnecessary switchover and results in loss of newer and updated data, besides causing network delay. The data loss occurs because updates may have been made in the new master machine, and when the original crashed master revives, it does not have the updated data.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Reply to comment | Linux Journal
14 min 59 sec ago - Nice article, thanks for the
10 hours 55 min ago - I once had a better way I
16 hours 41 min ago - Not only you I too assumed
16 hours 58 min ago - another very interesting
18 hours 51 min ago - Reply to comment | Linux Journal
20 hours 45 min ago - Reply to comment | Linux Journal
1 day 3 hours ago - Reply to comment | Linux Journal
1 day 3 hours ago - Favorite (and easily brute-forced) pw's
1 day 5 hours ago - Have you tried Boxen? It's a
1 day 11 hours ago




Comments
Re: Economical Fault-Tolerant Networks
Raza Bhai please murad bhai ko bhi kuch samjha deen....He is such a nela:::so please guide him and lead him:::we will be very thanks full to u:::thanks::From Lahore Office ...<|_|_| | Hafiz