Apache 2.0: The Internals of the New, Improved
The Apache Project is a collaborative software development effort aimed at creating a robust, commercial-grade and freely available source code implementation of an HTTP web server. The project is jointly managed by a group of volunteers located around the world, using the Internet and the Web to communicate, plan and develop the server and its related documentation. These volunteers are known as the Apache Group. In addition, hundreds of users have contributed ideas, code and documentation to the project.
According to the Netcraft web servers survey, Apache has been the most popular web server on the Internet since April 1996. This comes as no surprise due to its many characteristics, such as the ability to run on various platforms, its reliability, robustness, configurability and the fact that it is free and well-documented. Apache has many advantages over other web servers, such as providing full source code and an unrestrictive license. It is also full of features. For example, it is compliant with HTTP/1.1 and extensible with third-party modules, and it provides its own APIs to allow module writing. Other interesting features that have made it a popular web server include the capability to tailor specific responses to different errors, its support for virtual hosts, URL rewriting and aliasing, content negotiation and its support for configurable, reliable piped logs that allows users to generate logs in a format they want.
Apache 1.3 has been a well-performing web server, but it suffers a few drawbacks, such as its scalability on some platforms. For instance, according to Martin Pool, AIX processes are heavyweight, and a small AIX machine serving a few hundred concurrent connections can become heavily loaded. In such situations, using processes is not the most effective solution and a threaded web server is needed.
Furthermore, with the evolution of the requirements imposed on web servers, new functionalities like higher reliability, higher security and further performance are required. In response, web servers must evolve to satisfy these demands. Apache is no exception, and it continues its drive to become a more robust and a faster web server with its new 2.0 version (see sidebar).
Apache is renowned for its portability because it works on several platforms. However, having the same base code of Apache portable on so many platforms comes with a high price, which is the ease of maintenance. The Apache server has reached a point where porting it to additional platforms is becoming more complex. Therefore, in order to give Apache the flexibility it needs to survive in the future on more platforms, this problem had to be addressed and resolved. As a result, Apache will be able to use specialized APIs, where they are available, to provide improved performance, making it easy to port to new platforms.
Apache was intended initially to work on standard UNIX systems. However, its support for other platforms grew and the number of platforms supported affected the simplicity of the source code. One effect is that the code makes extensive use of conditional compilation to cope with platform peculiarities. Writing to a standard POSIX API is also undesirable on some platforms that provide substandard implementations or faster paths.
To solve these problems, Ryan Bloom is leading efforts to develop a solution, a layer called the Apache Portable Runtime (APR). The APR presents a standard programming interface for server applications and covers tasks such as file I/O, logging, mutual exclusion, shared memory and managing child processes and asynchronous I/O. APR shields the application from incompatibilities in the implementation of the standard, and thus it will use the most efficient way to achieve each function on each supported particular platform.
Another component that helps to resolve portability problems is Ralph Engelschall's MM library, which hides the details of setting up shared memory areas between processes and provides an interface similar to malloc to manipulate them.
The MM library is a two-layer abstraction library that simplifies the usage of shared memory between forked processes under UNIX platforms. On the first (lower) layer, it hides all platform-dependent implementation details (allocation and locking). When dealing with shared memory segments and on the second (higher) layer, it provides a high-level malloc(3)-style API for a convenient and well-known way to work with data-structures inside those shared memory segments.
The traditional Apache structure is based on a single parent process and a group of reusable children (see Figure 1). The parent reads the configuration and manages the pool of children. Each child at any time is either serving a single request or sleeping. Apache 1.x automatically regulates the size of the pool of children so that there are enough to cope with spikes in load without using too many resources to maintain idle processes. Busy children serve one request at a time on a single socket.
Figure 1. Traditional Apache Structure
Some web sites are heavily loaded and receive thousands of requests per minute or even per second. Traditionally TCP/IP servers fork a new child to handle incoming requests from clients. However, in the situation of a busy web site, the overhead of forking a huge number of children will simply suffocate the server. As a consequence, Apache uses a different technique. It forks a fixed number of children right from the beginning. The children service incoming requests independently, using different address spaces. Apache can dynamically control the number of children it forks based on current load.
This design has worked well and proved to be both reliable and efficient; one of its best features is that the server can survive the death of children and is also reliable. It is also more efficient than the canonical UNIX model of forking a new child for every request.
This traditional Apache design works well up to quite high loads on modern UNIX systems. On Linux in particular, context switches and forking new processes are cheap, and accordingly this simple design is nearly optimal. One drawback, however, of the isolation between processes is that they cannot easily share data, and consequently sharing session data across the server takes a little work.
Another approach is to serve each request in a separate thread: this is the model used by most NT-based web servers. Although this approach eliminates most of the protection between tasks, it allows the module programmer more flexibility and it can be faster on systems where threads are cheaper than processes, such as Windows NT and AIX.
Apache 2.0 introduces MPMs (multiple-processing modules) that hide the process model from most of the code. At runtime, Apache can be configured to use threads, processes, a hybrid of both or some other model. Modules can register new process models to suit their operating systems or the applications. One proposed example is to fork processes that run as different users to give increased security on machines that offer virtual hosts to multiple customers.