Writing Modules for mod_perl
CGI programs are a common, time-tested way to add functionality to a web site. When a user's request is meant for a CGI program, the web server fires up a separate process and invokes the program. Anything sent to the STDOUT file descriptor is sent to the user's browser, and anything sent to STDERR is filed in the web server's error log.
While CGI has been a useful standard for web programming, it leaves much to be desired. In particular, the fact that each invocation of a CGI program requires its own process turns out to be a large performance bottleneck. It also means that if you use a language like Perl where the code is compiled upon invocation, your code will be compiled each time it is invoked.
One way to avoid this sort of problem is by writing your own web server software. Such a project is a significant undertaking, though. While the first web server I used consisted of 20 lines of Perl, most servers must now handle a great many standards and error conditions, in addition to simple requests for documents.
Apache, a highly configurable open-source HTTP server, makes it possible to extend its functionality by writing modules. Indeed, modern versions of Apache depend on modules for most functionality, not just a few add-ons. When you compile and install Apache for your computer system, you can choose which modules you wish to install.
One of these modules is mod_perl, which places an entire Perl binary inside your web server. This allows you to modify Apache's behavior using Perl, rather than C.
Even if you plan to use approximately the same code with mod_perl as you would with CGI, it is useful to know that mod_perl has some built-in smarts that caches compiled Perl code. This gives an extra speed boost, on top of the efficiency gained by avoiding the creation of a child process in which to run the CGI program.
Over the last year, this column has looked at some of the most popular ways of using mod_perl, namely the Apache::Registry and HTML::Embperl modules. The former allows you to run almost all CGI programs untouched, while taking advantage of the various speed advantages built into mod_perl. HTML::Embperl is a template system that allows us to combine HTML and Perl in a single file.
Both Apache::Registry and HTML::Embperl offer a great deal of power and allow programmers to take advantage of some of mod_perl's power and speed. However, using these modules prevents us from having direct access to Apache's guts, turning it into a program that can handle our specific needs better than the generic Apache server.
This month, we will look at how to write modules for mod_perl. As you will see, writing such modules is more complicated than writing CGI programs. However, it is not significantly more complicated and can give you tremendous flexibility and power.
Keep in mind that while CGI programs can be used, often without modification, on a variety of web servers, mod_perl works only with the Apache server. This means that modules written for mod_perl will work on other Apache servers, which constitute more than half of the web servers in the world, but not on other types of servers, be they free or proprietary.
If portability across different servers is a major goal in your organization, think twice before using mod_perl. But if you expect to use Apache for the foreseeable future, I strongly suggest looking into mod_perl. Your programs will run faster and more efficiently, and you will be able to create applications that would be difficult or impossible with CGI alone.
CGI programmers have a limited view of HTTP, the hypertext transfer protocol used for nearly all web communication. Normally, a server receiving a request from an HTTP client (most often a web browser) translates the incoming URL into the local file system, checks to see if the file exists and returns a response code along with the file's contents or an error message, as appropriate. CGI programs are invoked only halfway through this process, after the translation has taken place, the file has been found and a new process fired off.
mod_perl, by contrast, allows you to examine and modify each part of the HTTP transaction, beginning with the client's initial contact through the logging of the transaction on the server's file system. Each HTTP server divides an HTTP transaction into a series of stages; Apache has more than a dozen such stages.
Each stage is known as a “handler” and is given the opportunity to act on the current stage of the HTTP transaction. For example, the TransHandler translates URLs into files on the file system, a LogHandler takes care of logging events to the access and error logs, and a PerlTypeHandler checks and returns the MIME type associated with each document. Additional handlers are called when important events, such as startup, shutdown and restart occur.
Each of these Apache handlers has a mod_perl counterpart, known by the collective name of “Perl*Handlers”. As you can guess from this nickname, each Perl*Handler begins with the word “Perl” and ends with the word “Handler”.
A generic Perl*Handler, known simply as PerlHandler, is also available and is quite similar to CGI programs. If you want to receive a request, perform some calculations and return a result, use PerlHandler. Indeed, most applications that are visible to the end user can be done with PerlHandler. The other Perl*Handlers are more appropriate for changing Apache's behavior from a Perl module, such as when you want to add a new type of access log, alter the authorization mechanism, or add some code at startup or shutdown.
I realize the distinction between Perl*Handlers (meaning all of the possible handlers available to Perl programmers) and PerlHandlers (meaning modules that take advantage of Apache's generic “handler”) can be confusing. Truth be told, confusing the two isn't that big a deal, since the majority of programs are written for PerlHandler and not for any of the other Perl*Handlers.
As I mentioned above, mod_perl caches Perl code, compiles it once, then runs that compiled code during subsequent invocations. This means that, in contrast to CGI programs, changes made in our program will not be reflected immediately on the server. Rather, we must tell Apache to reload our program in some way. The easiest way to do this is to send a HUP signal (killall -1 -v httpd on my Linux box), but there are other ways as well. Another method is to use the Apache::StatINC module, which keeps track of modules' modification dates, loading new versions as necessary.
As we know, CGI programs are stand-alone programs that are invoked from an outside process, namely the web server. PerlHandler modules are actually subroutines within the Apache process; Apache invokes our subroutine when a certain set of conditions is fulfilled.
Writing a PerlHandler module is not much different from writing any Perl module. (If you are unfamiliar with writing Perl modules, see the “perlmod” man pages, or any of the books available on the subject.) We create a module with a single subroutine defined, called “handler”, shown in Listing 1. This code has several elements common to many PerlHandler modules.
First of all, the entire module contains a single subroutine, “handler”. We can define additional subroutines if we want, but usually it is easiest to use the established standard and default.
Next, notice the handler is invoked with a single argument, which we call $r. It is an instance of the Apache object, which gives us access to the innards of the Apache web server. $r is our conduit to the outside world of the HTTP server and the user's browser. We invoke certain methods to determine the state of the server and browser and other methods to send output to the user's browser. Without $r we are somewhat lost, so it is natural that our first action upon entering “handler” is to retrieve $r.
We also use the -w and use strict programming aides in our program. While these are normally good ideas for good, clean Perl programs, they are essential when developing under mod_perl. As we will see later, mod_perl's caching and persistence means we need to be extra careful with our use of memory, in order to keep our HTTP server process as slim as possible.
Our handler uses only three methods from $r: content_type, send_http_header and print.
The first method, content_type, allows us to set or retrieve the “Content-type” header that will precede the response. Every HTTP response must be described with such a header, which tells the browser whether the response is an HTML-formatted text file, a GIF image or a zip file.
Once we have set the “Content-type” header to an appropriate value, we send all of the headers to the user's browser with the send_http_header method. Past this point, anything sent to the user's browser will be considered part of the HTTP response body, rather than the headers that describe that body.
The third method, print, is analogous to the built-in “print” function. However, it takes into consideration several factors that “print” might not, such as timeouts. $r->print takes a list of arguments just as the “print” function does. Thus, you can use
$r->print("a", "b", "c");
and expect three characters to be sent to the user's browser.
Once we have finished writing the response, we exit from our module by returning the OK symbol to the caller. We import OK from Apache::Constants, a module that provides us with a large number of useful symbols. In order not to pollute our name space too much, we explicitly request that only “OK” be imported with no other symbols.
If we were writing a more complicated module, we might use one of the export tags such as :common and :response, which allow us to import a group of symbols without having to name them explicitly. Thus, we could use the statement:
use Apache::Constants qw(:response);
which would import all symbols needed for a response.
Most PerlHandler modules will want their “handler” subroutines to return one of two symbols: either OK, which indicates that the handler successfully dealt with the request and no other PerlHandler needs to do anything, or the DECLINED symbol. If your module's “handler” routine returns DECLINED, it means “I was unable to do anything with the input I was given and would be happy if some other PerlHandler would do something.” Often, returning DECLINED means the default Apache behavior will be applied; if our PerlHandler were to return DECLINED, Apache would try to read the file named in the URL and do something with it. By returning OK, we indicate that our module took care of things, and Apache can move on to the next PerlHandler.
Now that we have seen how easy it is to write a PerlHandler module, let's look at how to install this module on our web server. We do this in the configuration file, typically named httpd.conf. If your copy of Apache uses three .conf files, understand that the division between them is artificial and based on the server's history, rather than any real need for three files. Apache developers recognized this increasingly artificial division and recently decided that future versions of the server will have a single file, httpd.conf, rather than three.
Apache configuration files depend on directives, which are variable assignments in disguise. That is, the statement
sets the “ServerName” variable to the value “lerner.co.il”.
If you want a directive to affect a subset of the files or directories on the server, you can use a “section”. For instance, if we say:
<Directory /usr/local/apache/share/cgi-bin> AllowOverride None Options ExecCGI </Directory>
then the AllowOverride and Options directives apply only to the directory /usr/local/apache/share/cgi-bin. In this way, we can apply different directives to different files.
“Directory” sections allow us to modify the behavior of particular files and directories. We can also use “Location” sections to modify the behavior of URLs not connected to directories. Location sections work in the same way as Directory sections, except that Location takes its argument relative to URLs, while Directory takes its argument relative to the server's file system.
For example, we could rewrite the above Directory section as the following Location section:
<Location /cgi-bin> AllowOverride None Options ExecCGI </Location>
Of course, this assumes that URLs beginning with /cgi-bin point to /usr/local/apache/share/cgi-bin on the server file system.
All this background is necessary to understand how we will install our PerlHandler module. After all, our PerlHandler will influence the way in which one or more URLs will be affected. If we (unwisely) want our PerlHandler module to affect all the files in /cgi-bin, then we use
<Location /cgi-bin> SetHandler perl-script PerlHandler Apache::TestModule </Location>
This tells Apache we will be handling all URLs under /cgi-bin with a Perl handler. We then tell Apache which PerlHandler to use, naming Apache::TestModule. If we did not install Apache::TestModule in the appropriate place on the server file system and if the package was not named correctly, this will cause an error.
The above example is unwise for a number of reasons, including the fact that it masks all the CGI programs on our server. Let's try a slightly more useful Location section:
<Location /hello> SetHandler perl-script PerlHandler Apache::TestModule </Location>
The above Location section means that every time someone requests the URL “/hello” from our server, Apache will run the “handler” routine in Apache::TestModule. Because we used a Location section, we need not worry whether /hello corresponds to a directory on our server's file system.
This is how mod_perl creates a status monitor:
<Location /perl-status> SetHandler perl-script PerlHandler Apache::Status </Location>
Each time someone requests the /perl-status URL from our server, the Apache::Status module is invoked. This module, which comes with mod_perl, provides us with status information about our mod_perl subsystem. Again, because we use a Location section, we need not worry about whether /perl-status corresponds to a directory on disk. In this way, we can create applications that exist independent of the file system.
Once we have created this Location section in httpd.conf, we must restart Apache. We can send it an HUP signal with
killall -HUP -v httpd
or we can even restart Apache altogether, with the program apachectl that comes with modern versions of the server:
apachectl restartEither way, our PerlHandler should be active once Apache restarts.
We can test to see if things work by going to the URL /hello. On my home machine, I pointed my browser to http://localhost/hello and received the “testing” message soon after. If you don't see this message, check the Apache error log on your system. If there was a syntax error in the module, you will need to modify the module and restart the server as described above.
The first time you invoke a PerlHandler module, it may take some time for Apache to respond. This is because the first time a PerlHandler is invoked on a given Apache process, the Perl system must be invoked and the module loaded. You can avoid this problem to a certain degree with the PerlModule directive, described later in this article.
The subroutine we just created might seem trivial, but it demonstrates the fact that we can easily modify the behavior of our web server simply by writing a Perl subroutine. Moreover, since subroutines can contain just about any sort of Perl code, we have at our disposal all of the Perl modules, operators, functions and regular expressions that would be available to a stand-alone program.
Indeed, our “handler” routine is simply an entry point to what can be a large, complex program with other subroutines. Since Perl*Handler modules have access to Apache at every stage of operation, we can modify anything using Perl. A growing library of modules that do many common tasks is available, so that you can spend time on the particulars of your problem, rather than reinventing the wheel.
Let's write another PerlHandler module, but this time let's have it do something other than return its own output. Just for fun, we will have it turn headlines in a file into Pig Latin. (In Pig Latin, the first letter of each word is moved to the end of the word, and “ay” is tacked on to the end.)
We will call our PerlHandler module Apache::PigLatin, which means we will create a module named PigLatin.pm and put it into the Apache module subdirectory. The source code is shown in Listing 2.
We install our module with a Directory section in httpd.conf:
<Directory /usr/local/apache/share/htdocs/stuff> SetHandler perl-script PerlHandler Apache::PigLatin </Directory>
Make sure the directive points to an actual directory in your Apache document tree.
The module introduces several new ideas, but nothing revolutionary. For starters, we import the constants OK, DECLINED and NOT_FOUND. As we indicated earlier, we will use OK to indicate that our PerlHandler did something, and DECLINED to indicate that Apache should apply some other behavior. We will use DECLINED to ensure our PerlHandler works on HTML-formatted text by checking $r->content_type. If the MIME type is “text/html”, we will operate on the file. If it is a JPEG image, we will refrain from translating it into Pig Latin, returning DECLINED.
Next, we attempt to open the file from $r->filename. This particular module is being used as a simple PerlHandler, so we can be sure the translation from URL to a file name on the file system has been performed. This translation takes place in the TransHandler stage, which we can modify by writing a PerlTransHandler, rather than a simple PerlHandler. While it has translated the URL into a file name on our system, Apache has not checked to see if the file exists—that is our job. If we cannot open the file, we will assume it does not exist, returning the symbol NOT_FOUND.
Now things get interesting: we grab the contents of the file and perform a substitution on headlines—that is, anything between <H\d> and </H\d>, where \d is a built-in character class matching any digit.
We use .*? to match all characters rather than a simple .*, so as to turn off the “greedy” feature in Perl's regular expressions. If we were to say .* rather than .*?, we would match all characters between the first <H\d> and the final </H\d>, rather than between the first pair, the second pair, and so forth. Greediness is usually a good thing when working with regular expressions, but can be frustrating under these circumstances.
We use four options in our substitution, using evaluation (/e), case-insensitivity (/i), global operation (/g) and the . regexp character to match \n (/s). This allows us to perform the substitution in one fell swoop, as well as catch any headlines that might begin on one line and continue on the next one.
Inside the substitution we invoke pl_sent, which is a subroutine defined within our module. This subroutine is not invoked directly from mod_perl, but is there to assist our “handler” routine in doing its work.
What's more, pl_sent invokes another subroutine, piglatin_word, which translates words into Pig Latin. If we were interested in creating a large web application based on mod_perl, you can see how it would be possible to do so, creating a number of subroutines and accessing them from within “handler”. C programmers might think of “handler” as the mod_perl equivalent of “main”, the subroutine invoked by default. Once in that routine, you can do just about anything you wish.
The pl_sent routine is interesting if you have never stacked split, map and join before. We split $sentence into its constituent words across \s+, which represents one or more whitespace characters. We then operate on each element of the resulting list with map, running piglatin_word on each word. Finally, we piece together the sentence in the end, using join to add a single space between each word. The result is returned to the calling s/// operator, which inserts the translated text in between the headline tags.
It is a much tougher problem to handle paragraphs, partly because people often forget to surround paragraphs with <P> and </P>, relying on the fact that browsers will forgive them if they simply say <P>. In addition, paragraphs contain punctuation which makes a good Pig Latin translator harder to write.
There is no limit to the kind of filters you can write. Perhaps the most interesting and advanced are those that use Perl's eval operator to evaluate little pieces of Perl code inside HTML files. A number of these already exist, such as Embperl (discussed several months ago) and EPerl. More simply, you can ensure that every file on your system has a uniform header and footer, removing the need for server-side includes at the top and bottom of each file.
mod_perl is an exciting development that has already made a great many new applications possible. But there is a trade-off for everything, and mod_perl's additional functionality comes at the expense of greater memory usage. It is hard to calculate the additional memory needed for mod_perl, but keep in mind that Perl can be a bit of a memory hog.
In addition, while lexical (“my” or “temporary”) variables disappear after each invocation of a Perl module rule via mod_perl, global variables stick around across invocations. This can be an attractive way to keep track of state in your program, but it can also lead to larger memory allocations.
For example, if your module creates an array with 10,000 elements, that array will continue to consume memory even after the program is invoked. This might be useful in some cases, such as when a complicated data structure is referenced in each invocation. However, it also means the large structure will constantly eat up memory, as opposed to only when necessary.
You can reduce memory usage by forcing mod_perl to share memory among Apache child processes. When you run Apache as a web server, it “preforks” a number of processes so that incoming connections will not have to wait for a new server process to be created. Each of these preforked servers is considered a separate process by Linux, operating independently. However, Apache is smart enough to share some memory among server siblings, at least to a certain degree.
mod_perl takes advantage of this shared memory by allowing the various server processes to share Perl code as well. However, there is a catch: you must make sure the Perl code is brought into mod_perl before preforking takes place. Perl modules and code compiled after the split occurs will raise the memory requirement for each individual server process, without regard to whether the same code has been loaded by another process.
In order to load code before Apache forks off child processes, use the PerlModule directive in the configuration files.
If, for example, you use the statement
in one of the *.conf files, then
use Apache::DBI;in a PerlHandler module, the latter invocation does not actually load any new code. Rather, it uses the cached, shared version of Apache::DBI that was loaded at startup by mod_perl.
You can load multiple modules with PerlModule, using the syntax
PerlModule Apache::DBI Apache::DBII Apache::DBIII
However, you can load only ten modules this way. If you want to load more, you can use the PerlRequire directive. Strictly speaking, PerlRequire allows you to specify the name of a Perl program to be evaluated only when Apache starts up. For example,
PerlRequire /usr/local/apache/conf/startup.plwill evaluate the contents of startup.pl before forking off Apache child processes. However, if you include a number of “use” statements in startup.pl, you can effectively get around PerlModule's ten-module limit.
Remember that PerlModule or PerlRequire is necessary for modules to be shared among the different Apache sibling server processes, but it is not sufficient. You will still have to import the module in your own program in order to reap the benefits.
When I first started to work with mod_perl, I thought it was useful for speeding up CGI programs and for running filters like Embperl. As I have grown more dependent on it in my own work, I am amazed and impressed by the power mod_perl offers programmers looking to harness the power of Apache without the overhead of external programs or the development time associated with C.
As you can see, writing mod_perl modules is not difficult and is limited only by your imagination. It does require that you think a bit more carefully about your programs than when you are working with CGI, since you can affect the Apache server in ways that will slow it down or otherwise hurt your system's performance.