A High-Availability Cluster for Linux
A major design factor is resynchronization (mirroring back) of the files once a failed node has recovered. A reliable procedure must be employed so that data which has changed on the failover node during the failure period is mirrored back to the original node and not lost, because the original node overwrites or deletes it in the restoration procedure. The resynchronization procedure should be implemented so that a node cannot perform any mirroring while another node has taken over its services. Also, before the services can be restarted on the original node, all files associated with it must be completely mirrored back to this original node. This must be done while the services are off-line on both nodes to prevent the services from writing to the files being restored. Failure to prevent this could result in data corruption and loss.
The main problem when using this solution was with IMAP4 and pop3 mail spools. If an e-mail message is received and delivered on serv2, and serv2 fails before mirroring can take place, serv1 will take over the mail services. Subsequent mail messages would arrive in serv1's mail spool. When serv2 recovers, any e-mail received just before failure will be overwritten by the new mail received on serv1. The best way to solve this is to configure Sendmail to queue a copy of its mail for delivery to the takeover node. In the event that the takeover node is off-line, mail would remain in the Sendmail queue. Once the failed node recovered, e-mail messages would be successfully delivered. This method requires no mirroring of the mail spools and queues. However, it would be necessary to have two Sendmail configurations available on both nodes: one configuration for normal operation and the other for node takeover operation. This will prevent mail from bouncing between the two servers.
I am not a Sendmail expert. If you know how to configure dual-queuing Sendmail delivery, please let me know. This part is still a work in progress. As a temporary measure, I create backup files on resynchronization of the mail spool with manual checking on node recovery, which is quite time consuming. I also prevent such difficulties by mirroring the mail spool as frequently as possible. This has an unfortunate temporary side effect of making my hard disks work overtime. Similar problems would be encountered when clustering a database service. However, a few large UNIX database vendors are now providing parallel versions of their products, which enable concurrent operation across several nodes in a cluster.
A node could fail for various reasons ranging from an operating system crash, which would result in a hang or reboot, to a hardware failure, which could result in the node going into standby mode. If the system is in standby mode, it will not automatically recover. The administrator must manually remove a standby lock file and start run-level 5 on the failed node to confirm to the rest of the cluster that the problem has been resolved. If the OS hangs, this would have the same effect as a standby run level; however, if the reset button is pressed or the system reboots, the node will try to rejoin the cluster, as no standby lock file will exist. When a node attempts to rejoin the cluster, the other node will detect the recovery and stop all cluster services while the resynchronization of the disks takes place. Once this has completed, the cluster services will be restarted and the cluster will once again be in full operation.
My choice of Linux distribution is Red Hat 5.1 on the Intel platform. There are, however, no reasons why this could not be adapted for another Linux distribution. The implementation is purely in user space. No special drivers are required. Some basic prerequisites are necessary in order to effectively deploy this system:
Two similarly equipped servers, especially in terms of data storage space, are needed.
Three network interface cards per server are recommended, although two might work at the expense of some modifications and extra LAN traffic.
Sufficient network bandwidth is needed between the cluster nodes.
My system consists of two Dell PowerEdge 2300 Servers, each complete with:
three 3C905B 100BaseTX Ethernet cards
two 9GB Ultra SCSI 2 hard disks
one Pentium II 350 MHz CPU
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Devuan Beta Release
- May 2016 Issue of Linux Journal
- EnterpriseDB's EDB Postgres Advanced Server and EDB Postgres Enterprise Manager
- The US Government and Open-Source Software
- The Humble Hacker?
- The Death of RoboVM
- BitTorrent Inc.'s Sync
- Open-Source Project Secretly Funded by CIA
- New Container Image Standard Promises More Portable Apps
- AdaCore's SPARK Pro
In modern computer systems, privacy and security are mandatory. However, connections from the outside over public networks automatically imply risks. One easily available solution to avoid eavesdroppers’ attempts is SSH. But, its wide adoption during the past 21 years has made it a target for attackers, so hardening your system properly is a must.
Additionally, in highly regulated markets, you must comply with specific operational requirements, proving that you conform to standards and even that you have included new mandatory authentication methods, such as two-factor authentication. In this ebook, I discuss SSH and how to configure and manage it to guarantee that your network is safe, your data is secure and that you comply with relevant regulations.Get the Guide