ATF Jubilee Edition
Welcome to the jubilee installment of At the Forge. This is the 50th column that I have written for Linux Journal (or for SSC's short-lived Websmith magazine) since early 1996. Over the last few years, we have explored a large number of Web-related technologies, techniques and applications, ranging from simple CGI programs to sophisticated database-backed applications written using mod_perl.
This month, I want to spend a bit of time prognosticating, looking into the future of web application development. On the one hand, things have never been more exciting for web application developers; the technology continues to advance at a remarkable rate, making it easier and easier to create sophisticated applications. At the same time, the increasingly crowded field of embedded programming languages, application servers and database adaptors makes it harder to decide which technology is most appropriate.
Because this column describes where I believe web technologies and application development are headed in the coming years, it should also serve as a sort of guideline for what future issues of ATF will contain. You can think of this month's installment as an indication of where my consulting firm is headed professionally, and thus what you can expect me to suggest and describe in the year (or more!) ahead. Since this is Linux Journal and Linux is my company's primary server platform, I will focus here on items that run with Linux and, preferably, those that are free software.
Web application development began soon after the Web itself was formed. Ever since the first dynamically generated content was sent to the first browser—an act which predates the CGI standard, to say nothing of Netscape, Internet Explorer and Apache—programmers have been designing increasingly sophisticated applications for use on the Web.
CGI, or the “common gateway interface”, soon arrived on the scene. CGI got its name because dynamically generated content was originally a means to give a web interface to non-web applications. With the advent of CGI, it was suddenly possible to create portable server-side programs. Most web applications continue to be written using CGI, because of its simplicity and its extreme platform-independent nature, as well as the fact that web space providers can give their clients CGI access without endangering the server's stability.
You can write a CGI program for any web server, in any language, on any operating system, and be virtually guaranteed that it will work. However, CGI has a number of drawbacks. In particular, it requires that the web server spawn a new process for each HTTP request aimed at a CGI program. In other words, a web site that receives 100 hits/minute is spawning more than one new process every second.
By itself, this should not scare you. After all, a basic Linux box should be able to handle the creation of one new process each second, right? However, the size of the new process, as well as the speed with which it starts up, are both important factors.
Perl, my programming language of choice for the last few years, has proven itself as a powerful means for creating CGI programs. The CGI.pm module provides an amazing array of functions that do nearly everything you would ever want from a CGI program (as well as a number of things that I would never consider doing). Moreover, Perl includes a powerful pattern-matching engine, along with modules that handle most popular Internet standards and protocols. The DBI (database interface) module has proven to be an additional boon, making it easy to include the output from an SQL query in a dynamically generated page.
However robust, flexible and secure Perl might be, the CGI standard was never designed for producing a large volume of dynamically generated pages on the fly. Each invocation of a CGI program written in Perl forces the computer to create a new process, load Perl into memory, load your program into memory, compile your program into Perl's internal opcodes and then, finally, interpret it using the Perl run time mechanism. This all takes time and means that CGI programs will not scale well over the long term. Indeed, it does not take a lot of concurrently running CGI programs to bring a typical server to its knees.
At the same time, CGI has been successful because it's so easy to use. With no other API can you write a “hello, world” program as simple as the following:
#!/usr/bin/perl -wT use strict; use CGI; my $query = new CGI; pring $querry->heder("text/html"); print $query->start_html; print "P>Hello, world!</P>\n"; print $query-> end_html;
In the last few years, advanced web development has become a specialty of its own, requiring that programmers learn something about administering systems, networks and databases, while keeping in mind good programming and security practices.
There are three current trends in the world of web development which are beginning to improve things dramatically for users as well as for developers. When the three are used together, they are often called an “application server”.
The first trend is architectural, moving away from single-shot CGI programs and toward programs that are cached within the web server or another environment. The only reason we use CGI programs for creating dynamic content is that the web server itself cannot create our custom HTML files on the fly. We could theoretically write a new module for Apache in C, and compile that into our configuration—but that is far too much work in most cases, and the time savings is not worthwhile.
However, there is a middle ground between putting the custom code inside of the server and leaving it completely outside. What if we put an entire programming language inside of the server, making it possible for us to add new functionality in that language? If the language is interpreted, then we can modify and debug our new functionality without having to recompile or restart the server.
This is the idea behind mod_perl, which embeds a copy of Perl inside of Apache. It gives us a Perl-language interface to Apache's internals, making it possible to access and modify anything having to do with the request object. Everything that a C-language module can do for Apache can also be done inside of mod_perl, from creating custom response handlers to changing the way in which authentication is performed.
In stark contrast with CGI programs, where Perl compiles the program once, executes it once and exits, mod_perl caches a compiled version of the program and then executes that repeatedly. (This can sometimes cause extreme memory growth and requires that programmers be especially careful.)
While mod_perl was once the only embedded language module for Apache, others have come along recently. mod_snake does for Python what mod_perl does for Perl, making it possible to write custom Apache handlers in Python. There is even a mod_tcl, which provides embedded Tcl inside of Apache, although I am not aware of any sites that are using its capabilities.
Another open-source web server, AOLServer, has long contained an embedded Tcl interpreter. Tcl procedures can thus be used to create dynamic output, connect to a relational database and make code conditional—all within the server itself, without having to go to an external CGI program.
If you would prefer to use Python over Tcl, a beta version of PyWX (Python Web Extensions) recently became available. PyWX provides a Python API to all of the Tcl functions that AOLserver normally provides. While this makes PyWX incompatible with most of the Tcl code available for AOLserver, it does make it easier to perform certain functions, given the wealth of Python modules available on the Web.
The second trend involves embedding code inside of HTML. Microsoft's Active Server Pages are perhaps the best example of such a practice, but there are plenty of other ones as well. On Linux, we can choose from a variety of different systems, ranging from Java Server Pages (JSPs), HTML::Mason (which works with mod_perl), PHP and ADP.
I have worked with Java off and on since it was first introduced, and was long ago convinced that it would be nice to spend some time working with the language. Like many other people, I was turned off by the idea of applets, which were slow, insecure and buggy. However, in recent years, server-side Java has become increasingly prevalent. Each Java “servlet” is a class that runs inside of a Java Virtual Machine (JVM). Servlets can accomplish all of the things that we might want to do when generating dynamic content—they can talk to databases with JDBC, they can retrieve and modify HTTP headers and they can produce responses whose content depends on the user's preferences.
JSPs make it easier to work with servlets by assuming that everything is literal HTML except for what is contained within <% and %>. When a JSP is invoked from a web browser, the JSP is compiled on the fly into a Java servlet, which is in turn complied into a Java .class file. This .class file is loaded into the servlet engine, executed and kept around for future invocations. JSPs and servlets can use Java “beans”, objects that can be used to model persistent behavior and to implement the “business logic” that sits in the middle of most modern three-tiered web applications.
mod_perl is a very powerful tool for creating Apache handlers, but it can sometimes force you to work at too low a level. For this reason, a fair number of Perl modules exist that allow you to mix Perl code and HTML in some way or another. HTML::Mason, which I profiled in a series of articles earlier this year, is the system that I prefer because of its simple syntax and the way it allows templates to incorporate one another. While at YAPC::Europe in London this fall, I saw a demonstration of the Template Toolkit, which seems to be similar to HTML::Mason in its philosophy, except that it adds the notion of “plug-ins”.
While Java and Perl are general-purpose programming languages that are well-equipped for server-side web programming, PHP is a language designed expressly for creating dynamic web pages. PHP includes a large number of functions for working with a variety of different kinds of files, databases and Internet standards. Recent versions of PHP even allow you to work with Java objects, and a CORBA adaptor is expected to be released in the near future. At the same time, PHP requires recompilation every time you change the included feature set; there is no notion of dynamically adding or deleting modules from the system. If you install PHP before deciding that you want to work with PDF files, you may find yourself recompiling it simply to add such features.
Users of AOLServer have a similar system at their disposal, known as ADP (“AOLServer Dynamic Pages”. An ADP page allows you to mix Tcl with HTML, where the Tcl can use any of a number of special procedures that are defined within AOLServer. You can thus create an ADP page that retrieves information from a database, interprets the contents of an HTML page returned from another server or simply performs calculations based on a user's HTML form inputs.
The third trend in the world of server-side programming is the issue of persistent database connections. Database servers were originally designed to handle one connection from each user every day, rather than once per minute or once per second. Consider this: if a CGI program connects to a relational database server once per second, you are exercising the connection mechanism more than 86,000 times what it was originally intended. In some cases, this does not mean very much, but there are many databases for which each connection is an expensive operation.
One solution is thus to open a database connection when the server first starts up and reuse that connection each time a program needs to contact the database. This is roughly what the Apache::DBI module does when working with Perl, Apache and mod_perl. Each time you disconnect from a database with $dbh-->disconnect, Apache::DBI silently ignores your request and keeps the connection around for the future. When you call DBI-->connect, Apache::DBI looks at the connection string and tries to reuse an existing connection before starting a new one. Since each Apache process services only one HTTP request at a time, each process thus needs only one database connection. The savings from this connect/disconnect sequence can be substantial. At the same time, it means that every child Apache process needs its own database connection, which can lead to dozens or hundreds of simultaneous connections on a heavily loaded server.
AOLServer cuts down on the number of database connections by using multiple threads rather than multiple processes. Because threads exist within the same process, they can share data. AOLServer takes advantage of this to create a small pool of database connections, choosing a connection at random and handing it to the thread handling an HTTP request as necessary. Database connections are not dedicated to a particular thread and can be shared as necessary, reducing the number of connections that the server must open with a database.
Working with Java servlets and JSPs requires a different model altogether. The Jakarta-Tomcat servlet/JSP implementation normally exists outside of a web server, meaning that they're always on Tomcat process, regardless of how many Apache child processes are on the system. Within that Tomcat process, there may be any number of servlet threads executing concurrently. Normally, servlets and JSPs (and Java beans, which JSPs and servlets can use to provide persistence and/or a high level of abstraction) connect to a database using JDBC. But JDBC does not automatically provide connection pooling; while JDBC 2.0 does provide this capability, it is not completely automatic, and not many JDBC 2.0 drivers exist as of this writing.
Other languages take a different tack. For example, database drivers for PHP allow persistent database connections but require that the programmer ask for them. That is, you can connect to a PostgreSQL database with pg_connect, or you can create a persistent connection to PostgreSQL with pg_pconnect. The onus is placed on the author of a database driver to provide two different access functions, and on the PHP programmer to use the appropriate function for his or her needs.
Of these, I find AOLServer's technique of persistent, pooled connections to be the most elegant, since it works for all languages --although that is almost always going to be Tcl—and scales extremely well. mod_perl's Apache::DBI is a great solution for Perl programs, especially since it means that individual Perl programs and modules do not need to be changed in order to take advantage of the persistent connections. The fact that Apache::DBI only provides persistence, and not pooling, is a direct result of Apache's multiple processes; it is probably safe to assume that Apache 2.0, which will support threads as well as processes, will come closer to AOLServer's model when it is released.
JDBC's pooling is good, particularly after it seemed that everyone was writing their own class for connection pooling. However, it will only work for Java servlets and will not help on a server that requires a pool for multiple services, such as mod_perl and JSP. PHP's system is perhaps the crudest because it provides neither a standard database API, nor a means for database drivers to pool connections automatically, nor a way for programs to take advantage of those connections. However, the persistence does work and can certainly result in a significant speedup.
While I generally dislike the term “application server” for its ambiguity, it is clear that this is the direction in which the Web is moving. No longer will you design applications by writing one or more programs that exist on their own; rather, you will write a program using a set of objects and modules provided by the application server and into which your application fits naturally. In many cases, you can create relatively sophisticated applications with a minimum of work, simply because someone else has done the majority of the work for you.
Of course, this means that we're increasingly seeing operating systems as the underlying layer for an application server, where the latter is the truly important element. Just as a client-side application author must decide whether to write for Windows, UNIX or Macintosh, web application developers must increasingly decide which application server they prefer to use. As with operating systems, it is very difficult to move from one application server to another. This means, unfortunately, that choosing an immature, slow or difficult-to-modify server may be painful in the future. Even application servers that conform to the same standards and use the same language, such as Enhydra and ATG Dynamo, provide different objects and functionality and make it difficult to move from one to the other.
To a free software devotee like myself, this means that open-source application servers are at least as important as open-source operating systems. Luckily, there are a number of open-source application servers available for download from the Internet. They differ radically in their operation and functionality, but I must admit that I have had only a little exposure to each of the following technologies. While I hope to learn more about them in the coming months, I mention them because it is clear that web developers need to learn more about all of them.
Perhaps the best-known application server platform is Zope, which comes with a number of parts and isn't well understood. Zope is an object database, a templating system and even a basic content management system. I have not yet had a chance to play with Zope in a serious way, but the little that I have read and heard about it seems very impressive, particularly if a module is already available for the particular functionality you need.
Another application server that has been getting a lot of publicity is the ArsDigita Content System, written and maintained largely by the ArsDigita consulting company and released under the GNU Public License. One main problem with ACS has been its dependence on Oracle as a database; while Oracle is an excellent database product, it is both expensive and its source is quite closed. A volunteer effort known as OpenACS has been working to solve this problem by porting the ACS software to use PostgreSQL as a database. The software is not quite complete but does include a great deal of functionality and will undoubtedly improve over time.
XML has been a hot topic in the Web community for several years now, but only in the last six to nine months have we begun to see its widespread adoption. XML describes content semantically, completely ignoring the way in which it should be displayed.
Enhydra is a Java-based application server that seems similar to Zope in many ways, except that it works with XML, Java servlets, JSP and Enterprise Java Beans. Enhydra appears to be quite complex, but also provides a large framework on which to create applications.
If you want to work with XML, then you might also want to look at the Cocoon and AxKit projects. Cocoon, which is sponsored by the Apache Software Foundation, is working on a Java-based server for XML data. AxKit provides XML-based content generation using Perl, making it possible to separate programs from content, and content from graphic design, using XML, XSL and XSLT along with Perl.
Finally, I should mention Oracle's latest entry into the world of application servers, its Internet Application Server (IAS). IAS is a module within Apache that works with a Java run time system, Enterprise Java Beans, JSP and JDBC, along with Oracle. As of this writing, the system is largely new and untested. Of course, Oracle does not provide access to its source code. At the same time, IAS runs under Linux and may well be a popular choice among Oracle users and administrators.
Until now, the majority of my consulting work has been with Perl, which I still find to be a powerful language for working with the Web. Indeed, I used to tell people that about 80% of my work was with Perl, with the other 20% a mixture of Java, Python, Tcl and C.
But with the recent explosion in web programming environments, and with the shift to application servers, I (and my staff) have had to change direction somewhat. In many cases, we will prefer to use Perl, especially when coupled with mod_perl and HTML::Mason. However, we are increasingly using Java servlets and JSPs for projects, particularly with the Tomcat servlet/JSP engine and with the PostgreSQL database. Our familiarity with mod_perl is naturally leading us to look at AxKit, while servlets are forcing me to take a serious look at Enhydra.
We have already begun to use ACS for some large jobs, in no small part because of the very large number of working applications—not just underlying tools—that come with it. Moreover, the fact that ACS is free software and works with Linux makes it easy to work with since we can rely on the community to provide functionality, documentation, testing and bug fixes.
In other words, there are lots of technologies out there, many of which have sprung up only within the last year or so. As I complete this 50th ATF column and look toward the future, I see a world of possibilities and opportunities for web developers, particularly those who believe in free software and use Linux. The coming years promise to be exciting and interesting for web developers—and over the coming months and years, I hope to share with you my experiments and experiences in working with such tools, as well as sample pieces of software that can be used with them.