High Availability Cluster Checklist
One of the greatest benefits of a high-availability cluster, which is ironically overlooked, is the ability to cleanly migrate services off a cluster member so you can perform routine maintenance without disrupting service to client systems. For example, this allows you to upgrade your software to the latest release or add memory to your system while keeping your site operational. Virtually all high-availability cluster offerings accommodate planned maintenance.
If you believe that a particular operating system is crash proof, give me a call and I'll sell you the Brooklyn Bridge to go along with that OS. Let's face it, system crashes are facts of life; it is merely a matter of minimizing their frequency. In response to a system crash, the other cluster members will conclude that a server has become nonresponsive and commence a take over of the services formerly provided by the failed node.
In the event of a system crash, virtually all fail-over cluster implementations will correctly takeover the services of a failed node. So far so good—it looks like just about any fail-over cluster product will suit you. Not so fast; the following points separate the credible offerings from the not so credible.
Typical high-availability cluster implementations consist of a set of cluster members, each monitoring the other's health over a variety of “cluster interconnects”. Historically, many proprietary cluster vendors have depended on custom hardware for their cluster interconnects. While this provides a solid cluster implementation, by nature it tends to be very expensive and locks you into a single vendor. To provide a cost-effective alternative, other cluster implementations monitor system health over commonly available network interconnects (commonly Ethernet) and serial port connections. In these configurations, the cluster members periodically exchange messages, and based on the response (or lack thereof) conclude whether the other members are up or down. This exchange of system health-monitoring messages is commonly referred to as a “heartbeat”.
A common problem with “heartbeat” based clusters is communication partitions. This is when cluster members (or a set of members) are up but are unable to communicate with one another. Take, for example, the diagram in Figure 2 depicting a two-node cluster with an Ethernet and Serial connection between the nodes over which heartbeat messages are exchanged.
Let us suppose you had set up your high-availability cluster and gone off to Las Vegas for the weekend, lulled into complacency with your company's new on-line ordering system deployed in this configuration. Further imagine the cleaning person accidentally knocking out the Ethernet connection with a broom. Now your two cluster members' cluster software running on each node must decide how to respond to this scenario in the interest of preserving high availability. Since the members can't communicate, they have to make the call in isolation. Here's some policy options commonly used by some cluster products:
Pessimistic assumption—Node A knows that it's serving the database but is unaware of node B's state, so node A continues to serve the database. Node B can't communicate with node A and assumes that node A is down. Node B then commences serving the database resulting in two cluster members serving the same database further resulting in database corruption and possibly a system crash. (As weak as this sounds, this policy is employed in some offerings!)
Optimistic assumption—After a site wide power outage, node A and node B both boot up at the same time. Neither node can ascertain the state of the other node and, just to be safe, they each assume that the other node is up so they do not start serving the database (to avoid data corruption). This results in a scenario where neither cluster member is serving the database. So much for spending money for a redundant cluster server! Actually, you're better off having your database unavailable than to have it corrupted. There are other failure scenarios that manifest themselves as a communication failure. For example:
An Ethernet adapter fails
The systems are connected to a common hub or switch that fails
The Ethernet cable fails
To avoid these forms of communication partition, a common clustering practice is to employ multiple communication interconnects. For example, you may have the systems monitor each other's health by heartbeating over multiple Ethernets or a combination of both Ethernet and serial connections. Similarly, you may have each of the network connections go through separate hubs/switches or be point-to-point links.
Most cluster implementations allow you to configure multiple communication interconnects to eliminate the possibility of a communication partition. (If they do not, you should probably quickly move on to another vendor.)
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Profiles and RC Files
- Astronomy for KDE
- Maru OS Brings Debian to Your Phone
- Understanding Ceph and Its Place in the Market
- Snappy Moves to New Platforms
- Git 2.9 Released
- What's Our Next Fight?
- OpenSwitch Finds a New Home
- The Giant Zero, Part 0.x
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide