Username/Email:  Password: 
TwitterFacebookFlickrRSS

Efficiently Updating Web Sites on Clusters

Using the page-flipping technique as inspiration for solving the problem of updating a web site on a cluster.

Management said the requirements for the web site infrastructure were simple. We could expect thousands of visitors to our site at any time over the next few months, as the marketing effort took hold. At any given time we would need to support up to a few hundred concurrent downloads of our desktop product demo. A noticeable performance drop in our web site was not acceptable.

At the time, our web site ran on a 2.2.x Linux distribution on a dual Dell 2450. Its performance was rock solid, but we were uneasy relying on a single machine for our entire business. We determined that the answer was to replace the single server with a cluster. For less than the cost of the Dell, we built a cluster of four 1U one-processor machines. The cluster's performance was excellent. By stripping down Apache, we could support at least 400 downloads over HTTP and still have a responsive site. This left one problem: we needed to be able to update the site often without affecting the performance.

Typical strategies for doing these frequent updates were not satisfactory. Either the site would be down for more than a few moments when the update occurred, or the site would be in an inconsistent state during the update. Worst of all, the site could be left in an inconsistent state if the update failed part-way through the process. To overcome these drawbacks I applied a little cross-discipline creativity. By applying the page flipping technique from the graphics world, I was able to achieve a quick and non-intrusive method of updating the clustered web site.

The Setup

The clustered web site consists of a director machine and a number of worker machines. The specifics of clustering are beyond this article, but loosely the director machine accepts the request from the client browser and then routes the request to a worker machine. The worker machine then responds directly to the client browser. The multiple worker machines provide scalability and a high level of reliability.

The cluster resides off site at a colocation facility, and a staging server is kept on site. This staging server contains the entire functioning web site for testing. Once the changes are acceptable, the site can be mirrored out to the worker machines in the cluster.

I used a number of techniques to ease the administration of the cluster. The worker machines in the cluster are all accessible from the web account on the staging server using a no-passphrase key pair over SSH. Various scripts were written using sudo so the web user can stop, start and reload Apache and Tomcat. Thus a single user logged into the web account on the staging server could control all the worker machines in the cluster with a small number of command-line tools.

The update-web script is part of this collection of tools and is used to update the web sites in the cluster. With this script, a user can update the entire cluster with a single command:

$update-web acme.com

where acme.com is the name of the site being updated. The names of the worker machines are kept in a common file used by all the scripts.

Double Buffering, Page Flipping and Web Site Configuration

Double buffering and page flipping are similar concepts. The idea is you draw an image or widgets to an area of memory that is currently not being displayed to the user. Once you are finished drawing, the complete image is displayed to the user all at once. With double buffering, the in-memory image is moved to the display area all at once using a block line transfer (blit). With page flipping, the area of memory written to is a special location in video memory, hidden from the user. When the drawing is done, the video hardware effectively changes a pointer and starts using the new in-memory image as the active display memory.

For our situation I chose to apply the page-flipping technique. We did this by creating two copies of the web site on each worker machine. A symbolic link then acts as the pointer that can be flipped. The directory structure is as follows:

  • /web sites/acme.com.1/: a copy of the web site

  • /web sites/acme.com.2/: another copy of the web site

  • /web sites/acme.com/: a symbolic link to the active copy of the web site, e.g., acme.com -> acme.com.1

The httpd.conf files for this site refer only to the symbolic link path. Therefore, we can quickly change which copy of the site is being used simply by changing the symbolic link.

To assist in this process some shell scripts are placed in the ~/bin directory of the web account on each worker machine. They are:

  • getActiveSitePath, a script that returns the full path to the active copy of the given web site (see Example 1).

  • getInactiveSitepath, a script that returns the full path to the inactive copy of the given web site (see Example 2).

  • web siteflipper, a script that flips the symbolic link from the inactive path to the active path (see Example 3).

These three scripts allow us to determine easily the inactive site, update it in whatever manner we choose and then make it the active site.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Efficiently Updating Web Sites on Clusters

Anonymous's picture
  • Master node participates in the cluster, and contains the binaries and docroot for the entire cluster on local drive space (raid etc.)
  • Master exports this directory to the other cluster members
  • Other cluster members mount up this target and launch from it
  • Changes are centralized, with shutdown/startup only being required when binaries or the web server conf file change

This is fairly simple and works well. Once this functionality is achieved, you can design/buy/implement as much redundancy as you wish. If loads really pick up, weight the balancing so that the master services fewer to no requests. Simple.

Re: Efficiently Updating Web Sites on Clusters

Anonymous's picture

Well it is simple but there are a few immediate drawbacks. The master node is a single point of failure. Sure you can cluster it, have RAID ect. But all the workers are relying on it and the network that the disk share is going over.

It's also a single point of attack. Change the site on the master an the whole site is affected. The cluster in the article is a bit harder to corrupt.

Further the disks on the worker machines are just sitting there. The idea of the cluster is to spread out the load. The design in the article keeps that attribute. You'll have an easier/cheaper time scaling if you can just add standard nodes and not have to worry about overloading your network.

I would argue that in production the setup in the article is at least as simple to admin as the master node design. The inital config might be a bit more complex for the design in the article, but this outweighed by the improved robustness and resource utilization.

all IMHO,

dave

Re: Efficiently Updating Web Sites on Clusters

Anonymous's picture

the cluster described in the article is not harder to corrupt, because the author is managing it with passwordless ssh

plus, the master described in the article does not participate in the cluster, which wastes a machine for handling requests

in both architectures the network must be available, so network issues are a wash in comparison

really, the author is being way too clever in solving the problem

it's a neat hack, but unnecessary

Re: Efficiently Updating Web Sites on Clusters

Anonymous's picture

The foundation layed out in this article is a very important one, it lends itself to a scenario such as having the first interface of each web machine configured to handle web requests and a second interface on each worker machine tied to an internal network for receiving the rsync+ssh updates, again with a passwordless ssh non-root account. Leaving the staging server on just that internal network and restricting access to it by having just specific users WinSCP files into the staging area allows you to have a physical means of separation for inserting new files and changes to the cluster. Since the staging server isn't handling outside web requests it can't be attacked in the same way that a worker machine can, and even if it were to go down the cluster still runs, unlike the far more network intense and less scalable single point of failure NFS tactic. With NFS you would have to configure a squid proxy to begin to see the benefits of storing content locally with rsync+ssh. The one thing I would add to the article is the concept of having a pair of staging servers. For staging servers that work as described in this article where by you replicate an entire iteration of the site to an inactive location on each worker, you can effectively have two staging servers on active hot standby so to speak, they really don't do anything when they are idle so either one can be used to deploy with at any given time. Usually only one person would be deploying at any given time so an additional copy can easily be made to the opposite staging server at that time. However another varation is where you just are updating one or two files, directly to the active area of the worker machines. In this case, each staging server updates all the workers, as well as the staging server adjacent to itself. To protect against a staging server coming back online and polluting the cluster with old files, it should attempt to rsync to the opposite staging server at boot time and pull in all files that have a newer timestamp. Once that is done it can proceed to update the workers, at a set interval if desired. Use clocksd+clockspeed to ensure the times of the staging server machines are in sync, and you can keep the workers in sync too for accurate timestamps in the web server logs.

Re: Efficiently Updating Web Sites on Clusters

Anonymous's picture

That works fine as long as the master is up, but a goal of most clustered systems, in addition to load balancing, is to not rely on any single server being available.

Re: Efficiently Updating Web Sites on Clusters

MrChook's picture

The main drawback of this approach is shared by all of the methods outlined here. If you have a slow link, it can take quite a while to transfer the entire site to each worker machine. Even on a fast link, the transfer is wasteful.

Did you consider updating just one of the workers from the staging server then using that worker to update the others?

This would obviously reduce the network bottleneck but maybe it introduces something else I haven't considered.