Advanced “New” Labels

Improving the way your site handles “new” labels using the popular Apache modules mod_perl.
Printing “New” on New Files

Of course, the point of this project is not to print “new” next to all links, but rather next to new ones. In order to do that, we will need to look at each link in sequence and check to see if it is on our system. If it is, we will check when the associated file was last changed. If that file was changed within the last week, we will tag it as new; otherwise, we will leave it alone.

Listing 2.

In order to do this, we will write another subroutine, which takes care of identifying a link and adds the appropriate text when necessary. That is, the subroutine will take a URL as input and will output either the same URL or the URL with a “new” label appended. Listing 2 is our new version of Apache::TagNew and it contains just such a subroutine, called label_url. The label_url subroutine expects to be invoked with three arguments: $r, the Apache request object, $url, the URL of the hyperlink in question, and $text, the text that goes between the <a> and </a> tags of the hyperlink.

We can know whether a file has changed only if it is on our system. Rather than parse through the URL, I decided to take the simple way by checking whether the URL in question begins with “http://”. If it does, then we assume the URL points to a file on a different system, and we ignore it, returning the URL and text in their original states.

If the URL begins with any other characters, it is assumed to point to a file on our system. We use $r to retrieve the value of the document root directory, namely the directory under which all URLs are stored. This module will work regardless of whether your web documents are under /usr/local/apache/share/htdocs, /etc/httpd/htdocs or even /usr/local/bin. $r retrieves the information from the httpd.conf file, which also means the module does not need updating if you decide to move the document root.

We then check to see whether the file was modified within the last seven days, using Perl's -M operator to get the last modification time. Luckily for our purposes, -M returns its result in days rather than seconds; so, we can simply compare the returned result with 7 and add the label as necessary. If the file was unmodified in the last seven days, the $label variable remains undefined and turns into the empty string later.

Our program returns the modified URL, much as it did in the previous version of Apache::TagNew.

We can evaluate this subroutine over every hyperlink in a document with s///, Perl's substitution operator. We give s/// three modifiers: g performs the operation globally, i ignores case and e replaces the initial text with the result of evaluating the substitution:

$contents =~
s|<a\s+href=['"]?(\S+?)['"]?\s*>([\s\S]+?)</a>
|label_url($r, $1, $2)|eigx;

The above regular expression is difficult to understand, so let us examine what it does in greater detail. We make the regexp more readable with the “x” modifier, which allows us to insert whitespace inside of it. We look for the opening <a> and closing </a> tags and extract from them the URL, which is grouped inside the first set of parentheses, and the link text, which is grouped inside the second set of parentheses. We use Perl's non-greedy operators to ensure we get only the necessary text. Otherwise, such things as quotation marks might be included in our link text.

We then invoke the subroutine label_url. We pass it three arguments: $r (the Apache request object), $1 (the URL we grabbed from the first set of parentheses) and $2 (the link text we grabbed from the second set of parentheses). Whatever label_url returns is substituted for the text we originally found. In this way, we can optionally insert a label into the text of the document.

Storing Information Across Sessions

The above system has several advantages, but it fails to keep track of when users went to a particular link. In other words, it is terrific at keeping track of a countdown timer for a particular URL, tagging it as new for the first seven days. But once again, we want to produce a “new” label when the document is new to a particular user. What if I have not visited a site in three months? Then all of the content is likely to be new, and “new” labels will be on everything. By contrast, if I visited the site two hours ago, only those labels that have changed since my visit will look different.

Keeping track of such information would require us to keep state across HTTP requests, so that we could keep track of which links were seen by a particular user. Unfortunately, HTTP is a stateless protocol, which means we cannot save such information. HTTP requests and responses take place in a vacuum, neither storing information for the next transaction nor retrieving information from a previous transaction.

HTTP's stateless nature has created problems for web programmers and designers who wish to create useful applications and has led to a number of clever solutions. Perhaps the most famous solution is the use of HTTP cookies, which allow a web server to store information on the user's computer. Each time the user submits an HTTP request to that server, all cookies previously stored are sent along with the request.

Cookies can store information in several ways. One is by putting the information inside the cookie, thus giving the server immediate access to further details about the user as part of the request. But this quickly becomes cumbersome if you have too much data. For this reason, it is common to use a table in a relational database to keep track of user information. If we define a primary key (i.e., a column guaranteed to be unique) for that table, we can store as much information as we like in the table.

Accessing a table in this way can be cumbersome, since it involves many database storage and retrieval operations. Luckily, we can use the Apache::Session module to handle such things. Apache::Session works with mod_perl programs to store and retrieve information across HTTP transactions.

We can retrieve the cookie in our handler using the header_in method. Notice how we are working with the raw cookie, meaning we must use s/// to retrieve the value of interest:

my $id = $r->header_in('Cookie');
$id =~ s/SESSION_ID=(\w*)/$1/;

Once we have done this, we can use Apache::Session::DBI, the module that connects sessions to a database table. We use Perl's tie routine, which creates a connection between a variable and a module, to provide a seamless connection:

tie %session, 'Apache::Session::DBI', $id,
      {
       DataSource => 'dbi:mysql:test',
       UserName   => 'username',
       Password   => 'password'
      };
You might recognize the three attributes in the above code fragment from DBI, the Perl database interface. DBI works with many different relational databases, thanks to its use of database drivers for specific databases. The above example uses the MySQL database, which I use for many of my database tasks. This example uses the “test” database to store our session information. While “test” is a good place for demonstration databases, you would be wise to put production databases somewhere else.

Apache::Session cannot create a table in MySQL for you. Before using the above code, you will need to create a table in which Apache::Session can store its session information. Here is the recommended table definition, from the Apache::Session::DBI manual pages:

CREATE TABLE sessions (
     id char(16),
     length int(11),
     a_session text
    );
______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix