A High-Availability Cluster for Linux
The resynchronization (mirroring back) procedure was implemented using rsync, which uses a lock file to disallow any mirroring to another node when a node failure is sensed. The lock file is checked for existence by sync-app before any files are mirrored. This prevents node A mirroring to node B, while node B is mirroring the same files to node A.
If preferred, clusterd could be used with a shared and/or distributed storage device by removing the resynchronization function and by not using sync-app, although I have not tried this.
To test server failure, I had to simulate the failure of every interface on the cluster. In each case, the cluster took the expected action and shut down the correct server. In the case of the inter-node/heartbeat network failing, the nodes simply carried on normal operation and notified the administrator of the failure. On a point-to-point network of this nature, it is almost impossible to determine which NIC is at fault. I simulated various network switch failures and power supply failures. The results were all as expected. After a node was put into standby (single-user) mode, I had to manually remove a standby lock file in order to fully bring up the node again. If a node recovered and entered a network runlevel while the standby lock file still existed, the remote node immediately put the node back into standby mode to prevent an IP and MAC address clash on the LAN.
Mirroring was tested over a period of several months, and I found that the nodes could typically compare 6GB of unchanged data in approximately 50,000 files in under 45 seconds.
After catastrophic node failure (I pulled the power plug from the UPS), recovery time for the node was around 10 to 15 minutes for fsck disk checking, and a disk resynchronization time of around three minutes (9GB of data). This represented a cluster services downtime of around three minutes to the LAN clients.
Failover delay from when a node failed until the remote node fully took over was typically 60 to 80 seconds. The effect on users depended on the service: Sendmail, IMAP4, http and FTP simply refused connection for users for the duration, whereas Samba sometimes momentarily locked up a Windows PC application when files were open at the point of failure. radius and dhcpd caused no client lock-outs, probably because of their UDP implementation.
On the whole, the cluster provides us with much better system availability. It is a vast improvement over the single server, as we can now afford to do server maintenance and upgrades during working hours. We have not yet had any catastrophic failures with the new Dell servers, but the test results show a minimal downtime of less than two minutes while a node takes over. We have saved large amounts of capital by implementing a simple high-availability cluster without the need for expensive specialist hardware such as dual ported RAID.
This clustering solution is certainly not as advanced as some of the commercial clusters or as thorough as some of the upcoming open source Linux-HA project proposals; however, it does sufficiently meet our needs.
The system has been in full-time production operation since September 1998. We have over 30 LAN clients using the cluster as their primary “server”. The system has proven to be reliable. The company sees the server as a business-critical system, and we have achieved the objectives of high availability.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Petros Koutoupis' RapidDisk
- ServersCheck's Thermal Imaging Camera Sensor
- The Italian Army Switches to LibreOffice
- Linux Mint 18
- Oracle vs. Google: Round 2
- The FBI and the Mozilla Foundation Lock Horns over Known Security Hole
- Firefox 46.0 Released
- Privacy and the New Math
Until recently, IBM’s Power Platform was looked upon as being the system that hosted IBM’s flavor of UNIX and proprietary operating system called IBM i. These servers often are found in medium-size businesses running ERP, CRM and financials for on-premise customers. By enabling the Power platform to run the Linux OS, IBM now has positioned Power to be the platform of choice for those already running Linux that are facing scalability issues, especially customers looking at analytics, big data or cloud computing.
￼Running Linux on IBM’s Power hardware offers some obvious benefits, including improved processing speed and memory bandwidth, inherent security, and simpler deployment and management. But if you look beyond the impressive architecture, you’ll also find an open ecosystem that has given rise to a strong, innovative community, as well as an inventory of system and network management applications that really help leverage the benefits offered by running Linux on Power.Get the Guide