Storage Cluster: A Challenge to LJ Staff and Readers

For a few years I have been trying to create a "distributed cluster storage system" (see below) on standard Linux hardware. I have been unsuccessful. I have looked into buying one and they do exist, but are so expensive I can't afford one. They also are designed for much larger enterprises and have tons of features I don't want or need. I am hoping the Linux community can help me create this low cost "distributed cluster storage system" which I think other small businesses could use. Please help me solve this so we can publish the solution to the open source community.

I am open to any reasonable solution (including buying one) that I can afford (under $3000). I already have some hardware for this project which includes all the nodes and 2 data servers which are: 2 @ Supermicro systems with dual Xeon 3.0 GHz CPUs, 8GB RAM, 4 @ 750GB Seagate SATA HDs.

I have tried to use all of these technologies at one point or another in various combinations to create my solution but have not succeeded. drbd, nfs, gfs, ocfs, aoe, iscsi, heartbeat, ldirectord, round robin DNS, pvfs, cman, clvm, glusterfs, and several fencing solutions.

Description of my "distributed cluster storage system":

  • Data server: 2 units (appliances/servers) that each have a 4+ drives in a RAID5 disk set (3 active, 1 hot spare). These 2 units can be active/passive or active/active I don't care. These two units should mirror each other in in real-time. If 1 unit fails for any reason the other picks up the load and carries on without any delay or hang time on the clients. If a unit fails, when it comes back up, I want the data to be re-synced automatically. Then the unit should "come back on-line" (assuming its normal state is active) after it is synced. It would be even more ideal if the data servers could be 1-N instead of just 1-2.
  • Data clients: Each cluster node machine (the clients) in the server farm (CentOS 5.4 OS) will mount 1 or more data partitions (in read/write mode) provided by the data server(s). Multiple clients will mount the same partition at the same time in r/w, so network file locking is needed. If the a server goes down (multiple HD failure, network issue, power supply, etc) the other server takes over 100% of the traffic and the client machines never know.

To re-cap:

  • 2 data servers mirroring each other in real time.
  • Auto fail over to the working server if one fails (without the clients needing to be restarted, or even being interrupted).
  • Auto re-sync of the data if a failed unit comes back on-line, when the sync is done the unit goes active again (assuming its normal state is active).
  • Multiple machines mounting the same partition in read/write mode (some kind of network file system).
  • Linux CentOS will be used on the cluster nodes.

What follows is a (partial) description of what I've tried and why it's failed to live up to the requirements:

For the most part I got all the technology listed working as advertised. The problems are mostly related to 1 server failing which can cause a 3 second+ hang in the network file system. The problem is that when a server goes down any solution that uses heartbeat or round robin DNS will hang the network file system for 3 seconds or more. While this is not a problem for many technologies like http, ftp, and ssh (which is what heartbeat is designed for) this poses a big problem for a file system. If you are loading a web page and it takes 3 seconds you may not even notice, or you would hit your reload button and it would work, however a file system is not anywhere near that tolerant of delays, lots of things break when a mount point is unavailable for 3 seconds.

So with DRBD in active/passive mode and heartbeat I set up a "Highly Available NFS Server" with these instructions.

With Linux's implementation of NFS the NFS mount will hang and require manual intervention to recover, even when server #2 takes over the virtual IP. Sometimes you have to reboot the client machines to recover the NFS mounts. This means that it is not "highly available" anymore.

With iscsi I could not find a good way to replicate the LUNs to multiple machines. As I understand the technology it is a 1 to 1 relationship not 1 to many. And again it would use heartbeat or round robin DNS for redundancy which would hang if one of the servers was down.

Moving on to GFS, I find that almost all of the fencing solutions available for GFS make GFS completely unusable. If you have a power switch fence, your client machines will be cold rebooted when the problem might be a flaky network cable, or an overloaded network switch. Cold rebooting any server is very dangerous (if you ask any experienced sysadmin). A simple example, if your boot loader was damaged for some reason after the machine was up it could run for years without a problem, if you reboot, the boot loader will bite you. If you have a managed network switch you could lose all communications with a server because the network file system was down. Many servers will need network communication for other things that are not reliant on the network file system. Again, very high risk in my opinion for what could/should be solved another way. The one solution I do like is the AOE fencing method. This simply removes the MAC of your unreachable client from the "allowable MACs" for the AOE software server. This should not effect anything on the client machine except for the network file system.

I did get a drbd, AOE, heartbeat, GFS, and an AOE fence combination working, but again when a server goes down there is a hang time on the network file system of at least 3 seconds.

Finally there is glusterfs. This seemed like the ideal solution, and I am still trying to work with the glusterfs community on getting this to work. The 2 problems with this solution are:

  • When a server goes down there is still a delay/timeout that the mount point is unavailable.
  • When a failed server comes back up it has no way to "re-sync" with the working server.

The reason for the second item is that this is a client side replication solution. This means that each client is responsible for writing its file to each server. So basically the servers are unaware of each other. The advantages of this client side replication is the scalability. According to the glusterfs community they are talking hundreds of servers and petabytes of storage.

A final note on all of this, I run one simple test on all my network file systems that I know will bring consistent results. What I have found is that any network file system will be at best 10-25% the speed of a local file system. (also note this is over 1gb copper ethernet not fiber channel) When running (this creates a 1gb test file):

dd of=/network/mount/point/test.out if=/dev/zero bs=1024 count=1M

I get about 11MB on NFS, 25MB on glusterfs, and 170+MB on the local file system. So you have to be ok with the performance hit, which is acceptable for what I am doing.

p.s. I just learned about Lustre, now under the Sun/Oracle brand. I will be testing it out soon.

Chad Columbus is a freelance IT consultant and application developer. Feel free to contact him if you have a need for his services.
______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I have tested it with iscsi and 3 sec is more than accettable

Mario Giammarco's picture

Hello,
I am building an ha san too. I am using iscsi over drbd and I am trying active/active (using iscsi multipath) and active/passive (failover on the ip). Linux fiber channel and iscsi stack is very strong and it tolerates several seconds of downtime.

I can:

- type "ls -R" at the shell;
- detach the cable of my fiberchannel pci adapter, wait 10 seconds and reattach it
- see that ls starts again showing files.

So I suggest you to export all your storage as an iscsi target.

Real problems are in fencing/stonith/quorum in my opinion.

Update

Chad Columbus's picture

Ok, I am about tom embark on another test. I have bought 3 more servers and will be trying out zfs on opensolaris to see if I can get this to work.

I will keep you updated.

clustrous jewels from the deep.....

leebert's picture

I don't know if this expands the complexity or gives you the exact performance profile you need (fast recovery from a downed node) but have you looked at Samba ctdb? Samba-CTDB enables Samba to provide the anycasting frontend as well as arbitration of failover across the cluster.

To get both the RAID 0 speed & the RAID 1 reliability (RAID 0+1, 1+0?) perhaps a stacked ceph(btrfs) with a gnbd/drbd pairing? Or would it be better as a RAID5 setup?

Perhaps that's a lot of layers but btrfs (like solaris zfs) is very very resistant to corruption (does end-to-end crc), is fast & is being included in many distros now.

The stack:

ceph-btrfs writes to GNBD device
GNBD client node writes to GNBD server node.
GNBD server writes to DRBD-primary.
DRBD begins to write to itself and to DRBD-secondary
(two DRBD primaries?)
(CLVM maps the space)

The layout:

...drbd/ ..| gnbd     |  file    |  clustered | "anycast"                   
...clvm  ..|..........|..system..|..fs        | LAN cluster mux

disknode1--,
............\
disknode2 -- node A ----- btrfs ---+-- ceph ----- samba-ctdb
................................../
disknode1 -- node B ----- btrfs--/
............./
disknode2---/

Node dropout scenario:

1. Before DRBD completes the write to DRBD-secondary (thus, before it returns since writes are synchronous) the DRBD-primary node looses power.

The GNBD server dies with the power loss.
GNBD client node drops connection to the GNBD server.

2. Heartbeat notices the death of DRBD-primary, switches the DRBD-secondary to DRBD-primary, re-exports /dev/drbd0 via GNBD, and re-creates the virtual IP which the GNBD client was connecting to.

3. The GNBD client writing on behalf of ceph/btrfs reconnects.

Now, what happens to the write originally going to the DRBD volume? Will the GNBD-client retry the write? Are there situations where the write could be dropped all together?

As for iSCSI, it could path out from samba directly & might be able to be multipathed?

http://ctdb.samba.org/iscsi.html

http://www.querzone.de/wiki/Wiki.jsp?page=ISCSI#section-ISCSI-ShuttingDo...

Heartbeat makes this solution fail

Chad Columbus's picture

Because of this line:
2. Heartbeat notices the death of DRBD-primary, switches the DRBD-secondary to DRBD-primary, re-exports /dev/drbd0 via GNBD, and re-creates the virtual IP which the GNBD client was connecting to.

You will hang all your clients.
Heartbeat takes 3-5 seconds to notice and switch IPs, this is way to long for NFS, iscsi, or GNBD to tolerate.

Plus any applications running would have to wait that long and lets face it for real time applications like asterisk VOIP that delay would cause everything to break.

drbd dual primaries?

leebert's picture

I see your point. 3 sec heartbeat latency is rather long in the context of direct-access I/O clients. If samba, ceph/btrfs didn't hang then gnbd might.

More reading....

-- Error masking: A node with the failed drive can mask errors from its upper I/O layers (gnbd, etc.) & run in a so-called "diskless" state. The docs say drbd will continue until it can gracefully migrate the I/O to solely the remaining peer(s) with impaired performance. It seems to me that drbd's own internal mirroring interconnect is preventing the disk error from percolating upward.

DRBD also has a dual primary mode where there is no master:slave mirroring. Perhaps there are internal quora arbitrating against splitbrain situations from arising? Would this eliminate the dependency on the 2ndary being brought online via heartbeat (no failover required)? In this case I don't know if a loss of one of the paired nodes would cause an actual interruption to higher I/O layer (gnbd).

Three seconds seems rather long in terms of actual stateful I/O requirements. It seems problematic for clusters to be vulnerable to so much heartbeat latency for continuity of service.

It seems to me what's needed in a fast-replication LAN/SAN environment is a higher-resolution heartbeat facility that resolves to the millisecond.

clustrous tewels from the deep.....

leebert's picture

Oh, sorry, forgot the samba ctdb link...

http://www.sambaxp.org/files/SambaXP2009-DATA/Henning_Henkel.pdf

(see page 14)

The samba team might be the people you want to consult on failover latency in this rig, they seem to have a certain amount of experience kicking the tires. And as another poster commented, NFS v4 does seem to have equivalent modern services found in samba....

StarWind Software offers storage cluster...

Ichiro Arai's picture

...for two nodes, just like DataCore does but at much more affordable price and also easier to use. Companies like LeftHand offer multi-node RAID5/6 network cluster and this technology is way superior to only two head cluster DataCore has. SANmelody was cool in 2005, it was OK in 2007 but it's 2010 now and you cannot ride the same old pig again and again. Here are some links:

http://www.starwindsoftware.com/starwind

and

http://h18006.www1.hp.com/storage/highlights/lefthandsans.html

Arigato!

-ichiro

Late to the party

Anonymous's picture

I'm trying to find a similar solution and stumble upon this months old discussion.

Does anybody think it's possible to achieve the requirements by using Openfiler? I haven't tried it nor know the stuff well enough to think it through so throwing it up for discussion. Pardon me if it's stupid but I'm too noob to realize it :)

Essentially, use OpenFiler to create iSCSI block devices which are then used to create MD RAID 1 volumes on the application servers. If either of the data server fails, the mdraid would simply switch to the other "iSCSI" device.

NFSv4

bmrk's picture

NFSv4 supports server replication, path-to-file mapping
and byte range file locking.

"For better redundancy, NFSv4 supports file system replication and migration on the server side. Using a special file system location attribute, clients can query the server regarding the location of a file system. If the server file system is replicated for load balancing or other such reasons, the client can receive all the locations of the requested file system. Using its own policies, the client is then able to mount and access the appropriate location for the file system it requested. Similarly, if a file system is migrated, upon receiving an error while accessing the old location, the client queries the new location of the file system and makes the necessary change to accommodate the relocation.

A final highlight of NFSv4 is the ability to have the server delegate certain responsibilities to the client in caching situations, which is necessary for providing true data integrity. Previous versions of NFS did not honor UNIX write semantics safely. With NFSv4, a server may provide a client read or write delegation for a certain file. If a client receives a read delegation for a file, then all writes to that file for the length of the delegation are not allowed for any other client. Additionally, if a client receives a write delegation for a file, then no other client may write or read to that file for the length of the delegation. Delegations may be recalled by the server when a client requests conflicting access to a file delegated to another client. In this case, the server notifies the delegated client using a callback path existing between client and server, and recalls the delegation. Delegations allow the client to locally service operations using the NFS cache without immediate interaction with the server, thus reducing server load and network traffic."

from:
http://www.ibm.com/developerworks/linux/library/l-net26.html

Locking, there in lies the problem

Chad Columbus's picture

The NFSv4 locking schema is a serious problem.
If other clients can't read or write to a file, that is very bad.
For that reason NFS is not a good solution to this problem.

It does support concurrent

bmrk's picture

It does support concurrent access through client leases (delegations aka oplocks, just a CIFS does).

http://tools.ietf.org/html/rfc3530#section-1.4.5
http://wiki.linux-nfs.org/wiki/index.php/Cluster_Coherent_NFSv4_and_Dele...

what we really need its good documentation.

Virtualization based fault-tolerance ?

Duke Atreides's picture

Hi chad,

This is a very interesting thread. I'm by far not a linux or storage expert, but building high-quality products with low resources is something that I can relate to easily.

The main problem as I see it is the downtime of the servers and the resync of the data.
A quick take on this might be to have two identical virtual machines as storage servers running on two different physical servers running a hypervisor that supports fault-tolerance, interlocking the CPU and IO of the secondary machine with that of the primary making them identical to one another.
Xen supports this (either with Marathon software or using extensions like Kemari) and so does Vmware with vSphere (do not that FT on vSphere has problems with storage vMotion, so that might be a problem).

I haven't got into the details of all of this different buzzwords I've mentioned, a quick Google query will solve that.

Hope I got what you wanted to accomplish right and maybe helped a little,

Best Regards,

Duke.

Openfiler

Anonymous's picture

You can install openfiler on your data servers. http://www.openfiler.com/ Or you can also buy hp lefthand virtual SAN appliance.

openfiler is free and simple to install and use. you can easily setup failover and load balancing in openfiler using web interface.

Via web interface?

Anonymous's picture

You can't set up an HA cluster in Openfiler's web interface. You have to get your hands dirty in the shell, and it's a pretty drawn out process for someone not familiar with Linux.

various

RainTown's picture

Hi (Chad)

drbd + heartbeat works, I'm using that now (on production systems). Take care on the client side, use fsid=X on server side, and make sure to set SYATD_HOSTNAME=SHARED_IP (at least in RHEL-clones) and symlinking /var/lib/nfs into the shared file-system. Also better to use UDP for mounts, for faster failover.

However ...

I don't really get the "in 5 seconds httpd dies" you write about, never seen that. On slow media, or slow network, or large files a single I/O might take many seconds?? I know you say this is a red herring, but I smell something fishy :-)

Second, failver is (hopefully) generated by a failure. Setting the bar at "without ANY delay or hang time" is IMO unrealistic given the money you are not throwing at the problem. Can you describe, in a pseudo-code sort of way, how you think this *might* work? I guess your 2 servers are on physically different network switches, to remove SPOFs there, maybe with LACP or similar, so those switches need to know what IP/MAC is what, as do the clients, and there is some latency is that info propagating.

Lastly, my usual hobby horse on these topics, is that complexity is the enemy of maintainability. Make sure the LCD of your sysadmin team understands your solution as you do - in my experience that means KISS. If not, you'll find your mega resilient, 99.9999% solution does not survive the first time someone other than you has to work with it.

RT

Another layer?

bradfa's picture

Is it reasonable to add another "layer" to your setup?

Such as you have the two servers full of disks and use a network system like iSCSI to export that storage to a diskless server that sees the two exports as a RAID 1. Then this diskless server is what exports the aggregate disk space out to your 40 "client" machines.

It adds another thing to fail (ie: the new "layer" server) but without disks in it, it's all solid state and should fail less (IMO).

In this way, when one of the servers full of disks go down, you fix it and bring it back up, and then the new "layer" server does the resync just like a normal RAID. You might need to invest in some faster networking, I'm not sure if a resync would leave you dead in the water over gigE.
Or does this add too much network overhead?

It would simplify the issue, as you'd be using multiple different technologies to solve multiple different problems, one at each layer (rather than trying to solve one problem with multiple technologies).

This may work

Chad Columbus's picture

The problem is you have introduced a single point of failure.
This is not an acceptable risk in a production system.
Do you have an idea on how to make the diskless server redundant?
Otherwise if a $0.50 network cable goes bad, you lose your whole cluster.

Sorry for the noise

bendailey's picture

Is there any other way to "subscribe" to comments than posting a comment?

Yet Another Approach

Alex Armstrong's picture

While reading the comments, two things occurred to me:

A - It might be possible to reduce the dead time in Heartbeat. So the servers check each others availability at less than one second intervals (deadtime 15 -> deadtime 0.01 in seconds). This would depend on both Heartbeats abilities and your cabling infrastructure (ethernet/serial etc.). And, unless we got down to tenths of seconds might not make enough difference.

B - On the other end, maybe we can Increase the latency in the client filesystems. If we increase the time between actual disk read/writes, the servers might go down and come back with no one the wiser. One way might be with the timeout options in NFS (timeo=n, retrans=n). Another way could be with the Linux Virtual File system - having it keep the files around longer, though I don't know how this would be done. This way means your data might get stale, if you have multiple clients updating it.

Unfortunately, I don't have a cluster, so testing is difficult. I offer my thoughts in the hope we can come up with something that works.
Alex

How about Sun Cluster?

Peter Teoh's picture

In the past, the closed source version it is called "Sun Cluster". But Opensolaris has opensource it, and it is called "Open HA Cluster".

http://hub.opensolaris.org/bin/view/Community+Group+ha-clusters/

Many real world installation scenario here (for example):

http://wikis.sun.com/display/OpenSolaris/Open+HA+Cluster+Summit+May+2009

A different approach

Sid Bartle's picture

Have you looked at Datacore’s SANMelody for the storage cluster? I use the big brother of this software SANSymphony and I would highly recommend it.
Out of the box you get iSCSI / Ethernet Support, Synchronous Mirroring (HA), Thin Provisioning, Snapshot and Async IP Mirroring (DR), coupled with virtualised clients (VMware) and multi pathed LUN’s this solution has proven itself to be very robust.

At the very least you can get a free 30 day trial and give it a whirl.

http://www.datacore.com/products/prod_SANmelody_buy.asp

Price

Chad Columbus's picture

To get high availability you would need a "C" license at least, at a cost of about $8K per server I could buy a hardware solution.

Pricing from here (from 2006):
http://www.tomshardware.com/reviews/iscsi-open,1217-7.html

I am waiting for an e-mail back from the company with current pricing.

Reply on DataCore pricing

Anonymous's picture

That pricing is over 4 years out of date!
They now have a range of offerings with prices starting at much less.

Can't use them

Chad Columbus's picture

Well, they don't publish their pricing and they have not responded to my e-mails, so I guess I can't evaluate them due to customer service reasons.

Oh an by the way it requires Microsoft Windows Server!

Right tool for the job

Sid Bartle's picture

Chad if your a OS aphobe, then no, this product will not be for you. For those of you that need a cheap solution that works then check it out yourself. Pricing is about 2K per server (0.5TB). I have tried to replace the functionality that this product provides (keep in mind we use FC) and any geo clustering product was up around the "you got to be joking" price mark.

We currently use 2 SANSymphony boxes to provide ~40 TB of geo clustered (active active) storage servicing about 60 physical production boxes, SUN, Linux, ESX, Windows and ~200+ VM's. Don't the UNIX admins hate when they find out all their precious Oracle DB's are being I/O serviced by a windows box. Me i don't care as long as the solution performs.

Coupled with VMware this product is amazing, I have ESX clusters (6 -8 per cluster) with half the cluster at each site and have instant site recovery without ever setting up SRM.

Just remember the point of this thread is that after 2+ years you could not find a stable solution using Open source.

OK, let flaming start...:D

Windows

Chad Columbus's picture

I am not opposed to windows. I use windows for my desktop.
I have never been able to get a windows server to be stable or secure though. I am not saying windows can not be those things, I am saying I am not a windows admin. Plus as far as I know the windows server license also carries a fee which I think is around $3000, so it is not $5000 per box for the 2 licenses. So we are now at $10,000 for 2 servers.

Could IP Anycast be of use here?

Michael Van Wesenbeeck's picture

IP Anycast implies that multiple machines can have/be the same IP address.
I haven't seen it mentioned in this discussion, maybe because it's not useful in this case (it doesn't help in resyncing a server that was down for example).
Anyway, I find this discussion really interesting, so I post here to get the follow ups.

Good luck on finding something working. I hope to hear about it here when you do :)

Regards,

Michael

Can you tell me more about IP Anycast?

Chad Columbus's picture

I have been thinking about this and in my mind the solution is for every node to request every file from every server, whichever server answers first wins. This would increase network overhead, but would eliminate the 3-5 second delay of a downed server. So if IP Anycast can be used to accomplish this I think we could have something.

more about IP anycast.

Michael Van Wesenbeeck's picture

Well, it's mentioned in the November 2009 issue of Linux Journal.
You can read the article on the website:
http://www.linuxjournal.com/magazine/ipv4-anycast-linux-and-quagga
They cover IGP Anycast in the article as that usually is confined to a single network (the DNS root servers use BGP Anycast). I think that fits your situation as only the slave servers need to talk to that Anycast IP which the data servers "share".

FTA:
What if that server fails? If the host fails, it will stop sending out routing advertisements. The routing protocol will notice and remove that route. Traffic then will flow along the next best path. Now, the fact that the host is up does not necessarily mean that the service is up. For that, you need some sort of service monitoring in place and the capability to remove a host from the anycast scheme on the fly.

Since the gist of your problem (that's still unsolved) is how to failover near instantaneously, it all depends on how fast you can detect it as failing and how fast you can remove a failed or failing host from the anycast scheme.

I've heard of the network filesystems you mentioned, but I don't have any first hand experience with them. How can you detect if the filesystem on one of the data servers is available or not? How does heartbeat detect that?
If one of the data servers goes down, I don't know how long it then takes for the router to notice and remove the route to that host (so I don't know if my suggestion here makes any sense).
It appears we need to break down the 3-5 second delay to pieces. What takes the longest: noticing the service is unavailable or handing over the IP. If it's handing over the IP, then IP anycast can help, I guess.

Regards,

Michael

3-5 second breakdown

Chad Columbus's picture

More or less the breakdown is that heartbeat checks every X seconds (1 in my case) to see if the servers are up. Then if Y consecutive failures (2 in my case) are detected it takes that server out of service. The server takes 1-3 seconds to take over the IP(s). This is what comprises the 3-5 seconds.

3-5 second fall over delays

Anonymous's picture

I don't know Heartbeat but I suggest adding a smarter faster switch between the router and the servers. Let the switch software do the failover instead of the servers(?) Heartbeat or the router.

Alternatively, you might add additional NICs to each of the servers, ie, primary and secondary servers both have 2 NICs each serving your IPs. That's more failure points but if it reduces fallover delay would it be worth it, ie, real-time applications?

service check

Michael Van Wesenbeeck's picture

Anycast would give a situation where you have 2 masters, so the IP takeover seconds are eliminated (not that I've tested any of this, just guessing).
What remains then is cutting down on the time spent to notice a server/service is down. It's been a while since I've played with heartbeat, but you might have to replace it with something faster. Do you currently use heartbeat to monitor if the host is up or if the filesystem is available?
I don't know enough about the filesystems mentioned to know if you can create a monitoring script that hooks into them.
You could test connectivity loss using iptables and see what time window you have before services using the filesystem break

Cluster

Paweł Brodacki's picture

"The first application that I know can't handle the 3-5 delay is httpd.
That by itself is a deal breaker for me. When there is a delay of that length, I have to restart httpd (once the delay is over) on all the machines. This means that an outage is not really 3-5 seconds, it means it is as long as it takes to notice httpd is down, restart httpd on all nodes (about 40), and for httpd to start-up and begin serving pages."

Hm... Looks like you might solve that using Red Hat cluster (https://www.redhat.com/docs/manuals/csgfs/) -- works on CentOS, of course. Have the httpd daemon be a monitored service and be restarted if service monitor notices that the service's not up. Monitoring could be fairly easy too -- just pull the page off the HTTP server and compare md5 sum to a reference one. Your httpds will die still, but will get resurrected automatically. Not a perfect solution, maybe, but a step closer to one.

Please See my "red herring" post

Chad Columbus's picture

Please See my "red herring" post

Storage Cluster challenge

Anonymous's picture

Am I too far off to suggest the problem is not the filesystems used but the network or router configuration issue? The problem is more about the lag time for a redundant server to respond rather than the server's filesystem. Sorry, I think this is a routing problem rather than server problem, and you might need to upgrade your routers and/or switches to improve response time.

You should be able configure at least 2 of your servers as mirror images of each other, both responding on your IP addresses, with one primary server, the other fully mirroring tranactions but immediately takes over should the primary fail. Again, IMHO, this is a router or switch configuration issue, that should help greatly reduce lag time.

That is not how heartbeat works

Chad Columbus's picture

Heartbeat is the software that controls the IP.
It monitors the 2 servers and if one goes down it switches all traffic to the other. This is done by testing the machines every X seconds (in my case 1), but to avoid false positives and flapping, heartbeat requires 2 or more failures. So it takes 2 seconds of down time for heartbeat to trigger the change over and about 1 second for the 2nd server to take over the IP. That is where the delay comes from.

If you know of a way to eliminate heartbeat, and use something else let me know.

Openfiler

Mister A's picture

Can't openfiler do the trick?

Not to my knowledge

Chad Columbus's picture

What configurations have you tried with openfiler that you think would work for this solution?

nexenta

Mister A's picture

Nexenta is another candidate perhaps?
It uses the opensolaris kernel space with ubuntu user space if I'm not mistaken.

openfiler

Mister A's picture

Well, I have just used openfiler for a standalone storage box. But you can build a high availability storage cluster with it (at least they advertise so on their website).

A bit of a red herring

Chad Columbus's picture

Ok everyone, first off thank you all for your comments and feedback.
I am exploring several new options.

I want to address all the "How to restart HTTP" questions/suggestions.
I have HTTP monitoring and automatic restart in place.
I was just using HTTP as an example, and it appears that many of you have tried to help me solve an HTTP problem. (That is the red herring)

Here is my new example: asterisk
When you are on a phone call with the server as the VOIP provider and the server suddenly looses its network file system for 3 seconds, your call just got dropped. Now imagine you are on a very important call, or a conference call with 10 others that also just got dropped.

The point is that the solution I am looking for needs to simultaneously talk to both servers so that if one fails the other answers and there is zero delay/downtime.

I hope that clears up part of what I am trying to do.

Custom system

MikeFM's picture

I couldn't find a system that fit my needs so I created something with FUSE. Is pretty simple. Each system has a list of peer filesystems to mirror (I use NFS but doesn't matter.) and a filesystem it should mirror them to. When it wants to read a block it checks to see if a newer version exists and if so updates the local copy and reads it. Otherwise it just pulls the local copy. If it can't access the remote file system it justs pulls from the local copy. When it writes it just writes and updates a write log. Every so often it checks it's peers write logs and syncs all changes. If it is down and restarts it syncs changes.

For the most part my needs don't include the same files being written by multiple servers at the same time so it works out pretty well. If I needed that I'd probably have to implement locking of some kind.

I like it because there is no one system that can go down and it's easy to configure. I give it a large ramdisk to use as cache too so it doesn't have to access the physical drive very often.

I wish hdd manufacturers would sell drives that could accept standard ram modules for caching. I'd stick a few gigs per drive to make things go fast. Or maybe even if a RAID controller had that kind of cache.

Restarting httpd?

MyNameIsDanny's picture

Chad,

Why not use a small program which launches httpd, and if it ever exits, restarts it? That would be MUCH quicker than noticing that it had died. We wrote a small program which does just that. It has options for how many times to restart, if it should restart if the program received a signal, etc. Would that help in your case?

Danny

See my "red herring post"

Chad Columbus's picture

See my "red herring post"

system and target

voyance's picture

the next step is to mount the exported iscsi device using the iscsi initiator. test it one node at a time by mounting the initiator, creating a file system and a test file, unmounting the file system and target and mounting it on the second node.

Maybe I missunderstand

sledge's picture

What about exporting the filesystems and then mirroring them on the client. It would be as if you had RAID-1 and a single failure would be ignored. I don't have a cookbook recipe for you, just the thoughts.

What tech?

Chad Columbus's picture

Exporting a file system how? Mirroring them on the client how? How would a failure be ignored? In raid1 if a drive fails it is not ignored. Please expand your thoughts I am open to any feasible solution.

Export via iSCSI, for

sledge's picture

Export via iSCSI, for example. Mirror using software (lvm). RAID1 syncs things, if one "drive" dies, the client stays up. Maybe I am completely off base here but if you have something on the client looking like a block device and it is mirrored, the loss of one member of a mirror degrades fault tolerance but not necessarily performance.

Single node

Chad Columbus's picture

If this was on a single node you are right, it would be possible to export a drive off each server and mirror them on the client.
The problem is that there are 40 nodes, and without a network file system with multiple server aware file locking, there will be tons of issues.

NFS client settings?

Anonymous's picture

Hi,

It's an interesting question you ask here, but I think we need more infomation on how the Clients are configured. I strongly suspect that NFS can be made to work here, esp with the HAproxy stuff. But it comes down to how your clients mount the filesystem(s) from the cluster.

Are you mounting via autofs? entries in /etc/fstab? What are those entries? Are you mouting hard or soft? If you mount with the 'soft' option, the clients should just hang while the 3-5 second failover happens. This should show up as a pause on the client side, but not actually kill them off or hang them.

Can you post logs of what the client dmesg when a hand happens?

Good luck! It's interesting, even if only for wanting to do something like this on my home systems.

cheers,
John

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix