Storage Cluster: A Challenge to LJ Staff and Readers
For a few years I have been trying to create a "distributed cluster storage system" (see below) on standard Linux hardware. I have been unsuccessful. I have looked into buying one and they do exist, but are so expensive I can't afford one. They also are designed for much larger enterprises and have tons of features I don't want or need. I am hoping the Linux community can help me create this low cost "distributed cluster storage system" which I think other small businesses could use. Please help me solve this so we can publish the solution to the open source community.
I am open to any reasonable solution (including buying one) that I can afford (under $3000). I already have some hardware for this project which includes all the nodes and 2 data servers which are: 2 @ Supermicro systems with dual Xeon 3.0 GHz CPUs, 8GB RAM, 4 @ 750GB Seagate SATA HDs.
I have tried to use all of these technologies at one point or another in various combinations to create my solution but have not succeeded. drbd, nfs, gfs, ocfs, aoe, iscsi, heartbeat, ldirectord, round robin DNS, pvfs, cman, clvm, glusterfs, and several fencing solutions.
Description of my "distributed cluster storage system":
- Data server: 2 units (appliances/servers) that each have a 4+ drives in a RAID5 disk set (3 active, 1 hot spare). These 2 units can be active/passive or active/active I don't care. These two units should mirror each other in in real-time. If 1 unit fails for any reason the other picks up the load and carries on without any delay or hang time on the clients. If a unit fails, when it comes back up, I want the data to be re-synced automatically. Then the unit should "come back on-line" (assuming its normal state is active) after it is synced. It would be even more ideal if the data servers could be 1-N instead of just 1-2.
- Data clients: Each cluster node machine (the clients) in the server farm (CentOS 5.4 OS) will mount 1 or more data partitions (in read/write mode) provided by the data server(s). Multiple clients will mount the same partition at the same time in r/w, so network file locking is needed. If the a server goes down (multiple HD failure, network issue, power supply, etc) the other server takes over 100% of the traffic and the client machines never know.
- 2 data servers mirroring each other in real time.
- Auto fail over to the working server if one fails (without the clients needing to be restarted, or even being interrupted).
- Auto re-sync of the data if a failed unit comes back on-line, when the sync is done the unit goes active again (assuming its normal state is active).
- Multiple machines mounting the same partition in read/write mode (some kind of network file system).
- Linux CentOS will be used on the cluster nodes.
What follows is a (partial) description of what I've tried and why it's failed to live up to the requirements:
For the most part I got all the technology listed working as advertised. The problems are mostly related to 1 server failing which can cause a 3 second+ hang in the network file system. The problem is that when a server goes down any solution that uses heartbeat or round robin DNS will hang the network file system for 3 seconds or more. While this is not a problem for many technologies like http, ftp, and ssh (which is what heartbeat is designed for) this poses a big problem for a file system. If you are loading a web page and it takes 3 seconds you may not even notice, or you would hit your reload button and it would work, however a file system is not anywhere near that tolerant of delays, lots of things break when a mount point is unavailable for 3 seconds.
So with DRBD in active/passive mode and heartbeat I set up a "Highly Available NFS Server" with these instructions.
With Linux's implementation of NFS the NFS mount will hang and require manual intervention to recover, even when server #2 takes over the virtual IP. Sometimes you have to reboot the client machines to recover the NFS mounts. This means that it is not "highly available" anymore.
With iscsi I could not find a good way to replicate the LUNs to multiple machines. As I understand the technology it is a 1 to 1 relationship not 1 to many. And again it would use heartbeat or round robin DNS for redundancy which would hang if one of the servers was down.
Moving on to GFS, I find that almost all of the fencing solutions available for GFS make GFS completely unusable. If you have a power switch fence, your client machines will be cold rebooted when the problem might be a flaky network cable, or an overloaded network switch. Cold rebooting any server is very dangerous (if you ask any experienced sysadmin). A simple example, if your boot loader was damaged for some reason after the machine was up it could run for years without a problem, if you reboot, the boot loader will bite you. If you have a managed network switch you could lose all communications with a server because the network file system was down. Many servers will need network communication for other things that are not reliant on the network file system. Again, very high risk in my opinion for what could/should be solved another way. The one solution I do like is the AOE fencing method. This simply removes the MAC of your unreachable client from the "allowable MACs" for the AOE software server. This should not effect anything on the client machine except for the network file system.
I did get a drbd, AOE, heartbeat, GFS, and an AOE fence combination working, but again when a server goes down there is a hang time on the network file system of at least 3 seconds.
Finally there is glusterfs. This seemed like the ideal solution, and I am still trying to work with the glusterfs community on getting this to work. The 2 problems with this solution are:
- When a server goes down there is still a delay/timeout that the mount point is unavailable.
- When a failed server comes back up it has no way to "re-sync" with the working server.
The reason for the second item is that this is a client side replication solution. This means that each client is responsible for writing its file to each server. So basically the servers are unaware of each other. The advantages of this client side replication is the scalability. According to the glusterfs community they are talking hundreds of servers and petabytes of storage.
A final note on all of this, I run one simple test on all my network file systems that I know will bring consistent results. What I have found is that any network file system will be at best 10-25% the speed of a local file system. (also note this is over 1gb copper ethernet not fiber channel) When running (this creates a 1gb test file):
dd of=/network/mount/point/test.out if=/dev/zero bs=1024 count=1M
I get about 11MB on NFS, 25MB on glusterfs, and 170+MB on the local file system. So you have to be ok with the performance hit, which is acceptable for what I am doing.
p.s. I just learned about Lustre, now under the Sun/Oracle brand. I will be testing it out soon.Chad Columbus is a freelance IT consultant and application developer. Feel free to contact him if you have a need for his services.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- Google's SwiftShader Released
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Interview with Patrick Volkerding
- Managing Linux Using Puppet
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Non-Linux FOSS: Caffeine!
- Tech Tip: Really Simple HTTP Server with Python
- SuperTuxKart 0.9.2 Released
- Parsing an RSS News Feed with a Bash Script
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide