A High-Availability Cluster for Linux
The resynchronization (mirroring back) procedure was implemented using rsync, which uses a lock file to disallow any mirroring to another node when a node failure is sensed. The lock file is checked for existence by sync-app before any files are mirrored. This prevents node A mirroring to node B, while node B is mirroring the same files to node A.
If preferred, clusterd could be used with a shared and/or distributed storage device by removing the resynchronization function and by not using sync-app, although I have not tried this.
To test server failure, I had to simulate the failure of every interface on the cluster. In each case, the cluster took the expected action and shut down the correct server. In the case of the inter-node/heartbeat network failing, the nodes simply carried on normal operation and notified the administrator of the failure. On a point-to-point network of this nature, it is almost impossible to determine which NIC is at fault. I simulated various network switch failures and power supply failures. The results were all as expected. After a node was put into standby (single-user) mode, I had to manually remove a standby lock file in order to fully bring up the node again. If a node recovered and entered a network runlevel while the standby lock file still existed, the remote node immediately put the node back into standby mode to prevent an IP and MAC address clash on the LAN.
Mirroring was tested over a period of several months, and I found that the nodes could typically compare 6GB of unchanged data in approximately 50,000 files in under 45 seconds.
After catastrophic node failure (I pulled the power plug from the UPS), recovery time for the node was around 10 to 15 minutes for fsck disk checking, and a disk resynchronization time of around three minutes (9GB of data). This represented a cluster services downtime of around three minutes to the LAN clients.
Failover delay from when a node failed until the remote node fully took over was typically 60 to 80 seconds. The effect on users depended on the service: Sendmail, IMAP4, http and FTP simply refused connection for users for the duration, whereas Samba sometimes momentarily locked up a Windows PC application when files were open at the point of failure. radius and dhcpd caused no client lock-outs, probably because of their UDP implementation.
On the whole, the cluster provides us with much better system availability. It is a vast improvement over the single server, as we can now afford to do server maintenance and upgrades during working hours. We have not yet had any catastrophic failures with the new Dell servers, but the test results show a minimal downtime of less than two minutes while a node takes over. We have saved large amounts of capital by implementing a simple high-availability cluster without the need for expensive specialist hardware such as dual ported RAID.
This clustering solution is certainly not as advanced as some of the commercial clusters or as thorough as some of the upcoming open source Linux-HA project proposals; however, it does sufficiently meet our needs.
The system has been in full-time production operation since September 1998. We have over 30 LAN clients using the cluster as their primary “server”. The system has proven to be reliable. The company sees the server as a business-critical system, and we have achieved the objectives of high availability.
|Red Hat Enterprise Linux 7.1 beta available on IBM Power Platform||Jan 23, 2015|
|Designing with Linux||Jan 22, 2015|
|Wondershaper—QOS in a Pinch||Jan 21, 2015|
|Ideal Backups with zbackup||Jan 19, 2015|
|Non-Linux FOSS: Animation Made Easy||Jan 14, 2015|
|Internet of Things Blows Away CES, and it May Be Hunting for YOU Next||Jan 12, 2015|
- Designing with Linux
- Red Hat Enterprise Linux 7.1 beta available on IBM Power Platform
- Wondershaper—QOS in a Pinch
- Internet of Things Blows Away CES, and it May Be Hunting for YOU Next
- Ideal Backups with zbackup
- Slow System? iotop Is Your Friend
- Hats Off to Mozilla
- New Products
- Non-Linux FOSS: Animation Made Easy
- diff -u: What's New in Kernel Development
Editorial Advisory Panel
Thank you to our 2014 Editorial Advisors!
- Jeff Parent
- Brad Baillio
- Nick Baronian
- Steve Case
- Chadalavada Kalyana
- Caleb Cullen
- Keir Davis
- Michael Eager
- Nick Faltys
- Dennis Frey
- Philip Jacob
- Jay Kruizenga
- Steve Marquez
- Dave McAllister
- Craig Oda
- Mike Roberts
- Chris Stark
- Patrick Swartz
- David Lynch
- Alicia Gibb
- Thomas Quinlan
- Carson McDonald
- Kristen Shoemaker
- Charnell Luchich
- James Walker
- Victor Gregorio
- Hari Boukis
- Brian Conner
- David Lane