Advanced “New” Labels

by Reuven M. Lerner

Last month, we looked at those pesky “new” labels that webmasters like to put on their sites. The intention is good, pointing us to documents we are unlikely to have seen before. In practice, the “new” labels are artificial, telling us when the document was last published, rather than whether it is actually new to us.

The techniques we explored last month—server-side includes, CGI programs, and templates—were interesting, but inefficient and slow. This month, we will look at ways to speed up our performance by using mod_perl, the Perl module for the Apache server.

What Is mod_perl?

We have discussed mod_perl on previous occasions in this column, but it is worth giving a quick introduction for those of you who may have missed it. The Apache server is built of modules, each of which handles a different part of the software's functionality. One of the advantages of this architecture is that it allows webmasters to customize their copy of Apache, including or excluding modules as necessary. It also means programmers can add functionality to Apache by writing new modules.

One of the most popular modules is mod_perl, which puts a copy of the Perl language inside the Apache server. This provides functionality on a number of levels, including the ability to set the configuration directives in Perl (or conditional, depending on whether or not certain Perl code executes). More significantly, it allows us to write Perl modules that can modify Apache's behavior.

When I say “behavior”, I mean both the behavior users see, displaying documents and responding to HTTP requests, and that which takes place behind the scenes, ranging from the way authentication takes place to the way logging is done.

Each invocation of a CGI program requires a new process, as well as start-up time. By contrast, mod_perl turns your code into a subroutine of the Apache executable. Your code is loaded and compiled once, then saved for future invocations.

When we first think about what happens to an HTTP request when it is submitted to Apache, it seems relatively simple. The request is read by Apache, passed to the correct module or subroutine and returned to the user's browser in an HTTP response. In fact, each request must travel through many (over a dozen) different “handlers” before a response is generated and sent. mod_perl allows us to modify and enhance any or all of these handlers by attaching a Perl module to it. The handler most often modified is called PerlHandler. Other more specific handlers are given other names, such as PerlTransHandler (for URL-to-filename translation) and PerlLogHandler (for log files).

This month, we will look at a number of PerlHandlers that will make it possible to create truly useful “new” labels for our web sites.

A Simple Version

The first PerlHandler we will define is rather simple: it puts a “new” label next to any link on a page. This is not a particularly difficult task or a good use of mod_perl. However, it does gently ease us into writing a Perl module for mod_perl, and it will form the basis for future versions we will write.

Our module begins much the same as any other module, declaring its own name space (Apache::TagNew, in this case), then importing several symbolic constants from the Apache::Constants package. The module defines a single subroutine, called “handler”. This is the conventional way to define a handler under mod_perl; that is, create a module with a “handler” subroutine, then tell Apache to use that module as a handler for a particular directory.

We instruct Apache to invoke our handler in the configuration file httpd.conf. For example, my copy of httpd.conf says the following:

PerlModule Apache::TagNew
<Directory /usr/local/apache/share/htdocs/tag>
SetHandler perl-script
PerlHandler Apache::TagNew
</Directory>

The PerlModule directive tells Apache to load the Apache::TagNew module. The <Directory> section tells Apache that the /tag subdirectory of my HTML content tree should be treated specially, using the handler method of Apache::New instead of the default content handler. Once we activate our module by restarting our server (or by sending it a HUP signal), any file in the /tag directory will be handled by Apache::TagNew, rather than Apache's default handler.

The first thing handler must do is retrieve the Apache request object, traditionally called $r. This object is the key to everything in mod_perl, since it allows us to retrieve information about the HTTP request, the environment, and the server on which the program is running. We also use $r to send data back to the user's browser.

Our method is expected to return one of the symbols we imported from Apache::Constants. Returning OK means we successfully handled the query, data has been returned to the user's browser, and Apache should move to its next stage of handling the request. If we return DECLINED, Apache assumes our module did not handle the request and it should find some other handler willing to do the job. There are a variety of other symbols we can return, including NOT_FOUND, which indicates that the file was not found on our server.

Listing 1.

In Apache::TagNew (see Listing 1), we normally return OK. We return NOT_FOUND if an error occurs when opening the file, and DECLINED if the file does not have a MIME type of “text/html”. Hyperlinks are going to appear only in HTML-formatted text files, so we can save everyone a bit of time and energy by letting another handler take care of other file types.

The rest of the handler works by reading the contents of the file, then replacing them with our new and improved version. We append a “new” label after every </a>, which comes after each hyperlink. In this way, every hyperlink is tagged as new.

Printing “New” on New Files

Of course, the point of this project is not to print “new” next to all links, but rather next to new ones. In order to do that, we will need to look at each link in sequence and check to see if it is on our system. If it is, we will check when the associated file was last changed. If that file was changed within the last week, we will tag it as new; otherwise, we will leave it alone.

Listing 2.

In order to do this, we will write another subroutine, which takes care of identifying a link and adds the appropriate text when necessary. That is, the subroutine will take a URL as input and will output either the same URL or the URL with a “new” label appended. Listing 2 is our new version of Apache::TagNew and it contains just such a subroutine, called label_url. The label_url subroutine expects to be invoked with three arguments: $r, the Apache request object, $url, the URL of the hyperlink in question, and $text, the text that goes between the <a> and </a> tags of the hyperlink.

We can know whether a file has changed only if it is on our system. Rather than parse through the URL, I decided to take the simple way by checking whether the URL in question begins with “http://”. If it does, then we assume the URL points to a file on a different system, and we ignore it, returning the URL and text in their original states.

If the URL begins with any other characters, it is assumed to point to a file on our system. We use $r to retrieve the value of the document root directory, namely the directory under which all URLs are stored. This module will work regardless of whether your web documents are under /usr/local/apache/share/htdocs, /etc/httpd/htdocs or even /usr/local/bin. $r retrieves the information from the httpd.conf file, which also means the module does not need updating if you decide to move the document root.

We then check to see whether the file was modified within the last seven days, using Perl's -M operator to get the last modification time. Luckily for our purposes, -M returns its result in days rather than seconds; so, we can simply compare the returned result with 7 and add the label as necessary. If the file was unmodified in the last seven days, the $label variable remains undefined and turns into the empty string later.

Our program returns the modified URL, much as it did in the previous version of Apache::TagNew.

We can evaluate this subroutine over every hyperlink in a document with s///, Perl's substitution operator. We give s/// three modifiers: g performs the operation globally, i ignores case and e replaces the initial text with the result of evaluating the substitution:

$contents =~
s|<a\s+href=['"]?(\S+?)['"]?\s*>([\s\S]+?)</a>
|label_url($r, $1, $2)|eigx;

The above regular expression is difficult to understand, so let us examine what it does in greater detail. We make the regexp more readable with the “x” modifier, which allows us to insert whitespace inside of it. We look for the opening <a> and closing </a> tags and extract from them the URL, which is grouped inside the first set of parentheses, and the link text, which is grouped inside the second set of parentheses. We use Perl's non-greedy operators to ensure we get only the necessary text. Otherwise, such things as quotation marks might be included in our link text.

We then invoke the subroutine label_url. We pass it three arguments: $r (the Apache request object), $1 (the URL we grabbed from the first set of parentheses) and $2 (the link text we grabbed from the second set of parentheses). Whatever label_url returns is substituted for the text we originally found. In this way, we can optionally insert a label into the text of the document.

Storing Information Across Sessions

The above system has several advantages, but it fails to keep track of when users went to a particular link. In other words, it is terrific at keeping track of a countdown timer for a particular URL, tagging it as new for the first seven days. But once again, we want to produce a “new” label when the document is new to a particular user. What if I have not visited a site in three months? Then all of the content is likely to be new, and “new” labels will be on everything. By contrast, if I visited the site two hours ago, only those labels that have changed since my visit will look different.

Keeping track of such information would require us to keep state across HTTP requests, so that we could keep track of which links were seen by a particular user. Unfortunately, HTTP is a stateless protocol, which means we cannot save such information. HTTP requests and responses take place in a vacuum, neither storing information for the next transaction nor retrieving information from a previous transaction.

HTTP's stateless nature has created problems for web programmers and designers who wish to create useful applications and has led to a number of clever solutions. Perhaps the most famous solution is the use of HTTP cookies, which allow a web server to store information on the user's computer. Each time the user submits an HTTP request to that server, all cookies previously stored are sent along with the request.

Cookies can store information in several ways. One is by putting the information inside the cookie, thus giving the server immediate access to further details about the user as part of the request. But this quickly becomes cumbersome if you have too much data. For this reason, it is common to use a table in a relational database to keep track of user information. If we define a primary key (i.e., a column guaranteed to be unique) for that table, we can store as much information as we like in the table.

Accessing a table in this way can be cumbersome, since it involves many database storage and retrieval operations. Luckily, we can use the Apache::Session module to handle such things. Apache::Session works with mod_perl programs to store and retrieve information across HTTP transactions.

We can retrieve the cookie in our handler using the header_in method. Notice how we are working with the raw cookie, meaning we must use s/// to retrieve the value of interest:

my $id = $r->header_in('Cookie');
$id =~ s/SESSION_ID=(\w*)/$1/;

Once we have done this, we can use Apache::Session::DBI, the module that connects sessions to a database table. We use Perl's tie routine, which creates a connection between a variable and a module, to provide a seamless connection:

tie %session, 'Apache::Session::DBI', $id,
      {
       DataSource => 'dbi:mysql:test',
       UserName   => 'username',
       Password   => 'password'
      };
You might recognize the three attributes in the above code fragment from DBI, the Perl database interface. DBI works with many different relational databases, thanks to its use of database drivers for specific databases. The above example uses the MySQL database, which I use for many of my database tasks. This example uses the “test” database to store our session information. While “test” is a good place for demonstration databases, you would be wise to put production databases somewhere else.

Apache::Session cannot create a table in MySQL for you. Before using the above code, you will need to create a table in which Apache::Session can store its session information. Here is the recommended table definition, from the Apache::Session::DBI manual pages:

CREATE TABLE sessions (
     id char(16),
     length int(11),
     a_session text
    );
Using Apache::Session

Once our handler has retrieved the user's ID from a cookie and established a connection with the database, we can store and retrieve session information at our convenience.

We can store information about this user in %session, the hash to which we tied Apache::Session. Each time our handler is invoked, we can retrieve information about this user based on his or her ID. For example, we can store a value with:

$session{"foo"} = "bar";

We can then retrieve that value in a later session with:

my $stuff = $session{"foo"};
While our program appears to be storing and retrieving values in %session, it is actually retrieving them from the database using DBI—which means that, so long as we ensure each user has a unique ID, we can keep everyone's values separate.

Since we have what amounts to a hash that extends across sessions, how can we store information on which URLs we have visited and when? The easiest way is to use the URL as a key into %session, then store the last time the user visited the site. For example, we can store the URL with the following code:

my $document_uri = $r->uri;
$session{$document_uri} = time;

We want to retrieve this information when determining whether a user has recently visited a particular link. In order to do that, we will modify label_url so that it expects a fourth argument, a reference to %session. This way, label_url will be able to retrieve session information about the URL in question. We create the reference by preceding %session with a backslash (\%session) before passing it to label_url. We then dereference the copy of %session as follows, at the beginning of label_url:

my $session = shift;
my %session = %{$session};
The full code of a working version of Apache::TagNew, including the label_url subroutine, is in Listing 3.

Listing 3.

The rest of label_url is largely the same, except for a portion in the middle where we test to see if the URL begins with a slash (/). We must be sure to store and retrieve the same key from %session; otherwise, we will get false readings regarding when we last visited the URL. Since we store the URL based on $ruri, which always begins with a slash and is relative to our server's root URL directory, we should retrieve the URLs in the same way.

We do this by getting the current URL and removing everything following the final slash:

$current_directory =~ s|^(\S+/)[\w.]+$|$1|;

What is left is indeed the current directory, to which we can prepend the URL:

$url = $current_directory . $url;
Now we can retrieve the session information about that URL, confident we are using the same set of keys for retrieval as we did earlier for storage. We retrieve session information about when we last viewed the file in question, turning it into a number of days relative to right now:
my $last_time = (time - $session{$url}) / 86400;
Then we retrieve the modification timestamp of this file, by prepending $rdocument_root (the full path name leading to each file on the web site, normally invisible to users) to the file. We can easily determine its modification date:
my $full_filename = $r->document_root . $url;
my $ctime = -M $full_filename;
Finally, we compare $ctime (the number of days since the file was modified) with $last_time (the number of days since the user last saw the file). If the former is smaller than the latter, we add the label:
if ($ctime < $last_time)
{
$label = "<font color=\"red\">New!</font>"";
}
This module seems to do a good job of labeling new documents on a user-by-user basis. As long as users enable cookies, they should be able to get an accurate reading of which files they have not seen in a long time.
Conclusion

For a medium that is supposed to adapt itself to our own needs, the Web is surprisingly primitive—for instance, in the way “new” documents are labeled on web sites. This month, we have seen how mod_perl allows us to personalize our site a bit more, showing people what is actually new from their perspective, rather than from the webmaster's perspective. I hope you also noticed how advanced some of these tools have become; with a little more than 100 lines of Perl code, we were able to make a substantial change to our web server that had little impact on performance, but provided great benefit to our users.

Resources

Reuven M. Lerner is an Internet and Web consultant living in Haifa, Israel, who has been using the Web since early 1993. His book Core Perl will be published by Prentice-Hall later this year. Reuven can be reached at reuven@lerner.co.il. The ATF home page, including archives and discussion forums, is at http://www.lerner.co.il/atf/.
Load Disqus comments