Apache 2.0

Reuven discusses the significance of the 2.0 release for web developers, administrators and the Open Source community.

As I write this, Apache 2.0 has been out in stable form for nearly a month—and from everything I can tell, it's definitely ready for prime time. While there are other open-source HTTP servers, Apache is definitely the best known and best supported. Apache is used on 60% of the web sites in the world, comes with virtually every Linux distribution and is even part of several commercial application servers. Both Zope and Jakarta-Tomcat have their own built-in HTTP servers, but almost no one exposes these servers directly to the Web. Rather, they use Apache as a front end because of its combination of performance and flexibility. This month, we take a closer look at Apache 2.0 [see also “Apache 2.0: the Internals of the New, Improved 'A PatCHy”', available at www.linuxjournal.com/article/4559].

Architecture

If you are familiar with Apache 1.x, then very few things in Apache 2.0 will surprise you. For starters, Apache continues to be highly modularized, allowing you to include only those modules that you deem necessary in your server. But whereas Apache 1.3 had a core module that included the basic HTTP implementation, Apache 2.0 has delegated even more supported protocols to modules. This has a number of advantages, including the fact that we can now add (and subtract) protocols as necessary from Apache. In other words, Apache has now become a general-purpose internet server, rather than just an HTTP server. How many projects will take advantage of this functionality remains to be seen.

Apache was never meant to be the fastest server on the planet. Rather, it was designed to be extensible via a system of modules. Each module provided a different piece of functionality; administrators interested in squeezing the last ounce of power from their systems don't have to include irrelevant modules. For example, if we know that our server will never run any CGI programs, then we can easily remove mod_cgi, gaining some CPU cycles and memory in the process.

Apache 2.0 continues in the long-standing Apache tradition of handling each HTTP transaction in a number of named phases. A module may examine or modify the transaction during any one of these phases by attaching its own handler to the appropriate hook. For example, mod_speling (which corrects capitalization and spelling mistakes in URLs—the name is purposely misspelled) attaches its handler to the “fixup” phase hook, executing immediately before the server generates a response.

In Apache 1.x, only one handler could fire for a given hook. In Apache 2.0, each handler not only registers itself for a given hook, but indicates when it would like to execute relative to other modules; mod_speling, for example, registers its handler as the final (APR_HOOK_LAST). If another module were to register with the fixup handler, it would execute before mod_speling. The fact that multiple handlers can fire for a given hook opens a world of possibilities that were previously too difficult to achieve.

On a similar note, Apache now makes it possible for one module to filter, or modify, the output of another module. This is currently possible with mod_backhand, but that module depends on a number of tricks and dark corners in the Apache API. Apache 2.0 is designed to allow modules to act as input or output filters. This means that if you want to add a standard set of headers or footers to your HTML pages, you can now do this across the board, including for dynamically generated pages created by CGI programs, server-side includes and mod_perl handlers.

The Apache configuration system now uses GNU autoconf rather than the Apache-specific system that was in use for versions 1.x. And, many of the C-language abstractions (such as hash tables and strings) that were included in previous versions of Apache have now been named the Apache Portable Runtime (APR). The APR is included with Apache and is configured and compiled into the server automatically when you build it.

Finally, Apache now comes with mod_ssl, which provides SSL and TLS encryption. Not only did Apache 1.x fail to come with such a module, but the two modules (Apache-SSL and mod_ssl) were incompatible and required patching the Apache source code before installation. The fact that mod_ssl will now be a standard part of every Apache installation is a huge relief for web site administrators and is most welcome.

MPMs

UNIX systems have long had the ability to run multiple processes simultaneously. I typically run Emacs, a GNOME terminal and Galeon on my Linux box; while a casual glance might only reveal these three processes, there are actually dozens more (sendmail, gnome-panel, Apache, syslogd and the like) that are executing without my direct knowledge. For a complete list of what is running on my computer, I can use the command ps aux.

The good news is that the process model is simple to understand, ensures stability on the system and is portable across many operating systems. Unfortunately, however, processes are relatively heavy and slow. Linux users are especially spoiled on this front because creating a new process on Linux is a surprisingly lightweight operation. But even on Linux, spawning a new process can sometimes be a bit extreme.

For this reason, an alternative model of threads has grown over the years. Using threads, a single process can be executing in multiple places at the same time. Threads offer many of the benefits of processes without the overhead. But there is a cost: programming with threads can be extremely tricky because it's always possible that a particular piece of code is executing in two different threads. You can always write (or rewrite) code to be threadsafe, but this is often a difficult task.

Because threads were both difficult and tricky to handle, and because Apache was originally designed to work only on UNIX machines, Apache 1.x worked exclusively at the process level—if you want to handle ten simultaneous HTTP requests, then you must have ten Apache processes running. Because it takes time to create a new process, Apache 1.x took an idea from NCSA HTTPd, preforking processes before they are actually needed. This means that Apache can be a bit slow to start up, but that handling the incoming connections does not take much time. Apache also allows administrators to indicate how many “spare servers” should always exist, adding and removing Apache processes as necessary.

Preforked Apache servers are solid, well understood and robust. But on many systems, using processes is inferior to threads. In particular, Windows uses threads far more than processes, which means that by sticking with processes, Apache was limited in its ability to penetrate the Windows market.

Apache 2.0 solves these problems with MPMs (multiprocessing modules). Each MPM is an Apache module that handles the details of processes and threads. On Windows, OS/2 and BeOS, this means that you can finally run Apache using a threading mechanism that is native to your operating system. On UNIX and Linux systems, you can experiment with a number of different models, choosing one that is appropriate for your needs.

The prefork MPM, which runs in exactly the same way as Apache 1.x did, is the default choice when you install Apache. Two other choices for Linux users are: 1) worker: the number of threads rises and falls (according to the number of incoming requests), but the number of processes remains constant; and 2) perchild: each process contains a fixed number of threads, but the number of such processes rises and falls according to the number of incoming requests.

It's too early to tell, but I expect that more MPMs will emerge over time, and that there will be numerous modules that take advantage of threads to pool database connections, share application data and spawn asynchronous tasks in the background.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState