Integrating Web Applications with Apache

When you deploy a web application, how do end users access it? Often web applications are set behind a gateway device through which end users can access it. One of the popular products to act as an application gateway on Linux is the Apache Web Server. Although it can function as a normal web server, it also has the ability to connect through it to other web servers.

In this article, I discuss what it takes to integrate a web application into Apache. This includes integrating the HTTP protocol functionality, customizing content to render properly and reusing pieces of configuration. Once you understand those basic bits of functionality, you'll have the tools you need to maximize your web applications' usability. So, let's get started!

Crash Course in RegEx

A mechanism that I use throughout this article that might need a brief introduction is Regular Expressions (or regex). Regex is used to define a text pattern to search for within a URL or to find and replace text within content, such as HTML or JavaScript. The text processing command sed uses regex to do searches and substitutions.

For each example below there will be three parts: input, regex pattern and output. The pattern will be applied to the input text and determine the value of the output text.

Example 1:


Input:
  Name: Frank Sinatra
  Genre: Jazz
  Name: 2Pac
  Genre: Rap
  Name: Reel Big Fish
  Genre: Ska

Regex pattern: "^Name: "

Output:
  Name: Frank Sinatra
  Name: 2Pac
  Name: Reel Big Fish

This example searches the input text for text that matches the pattern "^Name: ". This pattern says, "Look for the text 'Name: ' at the beginning of each line." Since there are two lines that begin with that text, only those two lines are returned. While "^" represents the beginning of a line, "$" represents the end of a line. So if you were to apply the pattern "a$", two lines would be returned (Frank Sinatra and Ska). Let's expand on that example and use the input from Example 1 with a new pattern.

Example 2:


Regex pattern: "^Name: [0-9]"

Output:
  Name: 2Pac

As you can see, I've taken the original regex pattern and added [0-9] to the end. This will search for a single character that can be any number from 0 to 9, which is why "2Pac" was the only line returned. You also can specify a range with alphabetic characters ([a-z] or [A-Z]).

Along with pattern selection, you also can do substitution with regex. There are two formats for regex substitutions: s|pattern|replace|modifier or s/pattern/replace/modifier. In Apache, I find it easier to use the pipe-style substitution. Example 3 uses the same input with a new pattern.

Example 3:


Regex pattern: "s|^(.*)Frank(.*)$|\1Dwezil\2|g"

Output:
  Name: Dwezil Sinatra
  Genre: Jazz
  Name: 2Pac
  Genre: Rap
  Name: Reel Big Fish
  Genre: Ska
  Name: Dwezil Zappa
  Genre: Unknown

This pattern has a lot to dissect. One of the great features of regex is the ability to match any character. The dot operator will match any one character. The asterisk operator will match 0 or more of whatever character or operator preceded it. Putting these two operators together matches 0 or more of any character. Enclosing this in parentheses allows the matched text to be represented in the replace portion of the pattern with a variable. In this case, \1 represents the first block of text within parentheses and \2 represents the second. The only characters that are explicitly being matched are "Frank". As such, the lines containing "Frank" will be replaced with everything up to "Frank" (represented by \1), "Dwezil", and everything following "Frank" (represented by \2). As you can see, the entirety of the text input was sent to the output although modified by the pattern.

Protocol Integration

When it is decided that an application would benefit from Apache integration, there is a high likelihood that it will reside on a separate server from Apache. To integrate applications being accessed via HTTP fully, any or all of these modules may be used: mod_rewrite, mod_proxy, mod_ssl and mod_headers. Each of these modules allows you to customize the way communication between the end user and web servers occurs from modifying HTTP header data to managing proxy connections to other servers.

First, let's look at mod_rewrite. There are a number of directives within the mod_rewrite module, but I cover only a handful here: RewriteEngine, RewriteCond and RewriteRule. The RewriteEngine directive simply enables URL rewriting and is invoked as follows:


RewriteEngine on

RewriteRule allows the server to respond to an HTTP request to a specific URL by, among other things, returning an HTTP redirect (code 301 or 302), which will redirect the end user to a specified URL or send a proxied request to a back-end server. Here's an example of issuing an HTTP redirect:


RewriteRule /google http://www.google.com [R=301]

In this example, when the URL of /google is accessed, the server will respond with an HTTP 301 that will redirect the user to http://www.google.com. This example will work only if the request URL is exactly equal to "/google". If the need is to redirect on any URL starting with "/google", you would define a conditional redirect using RewriteCond as follows:


RewriteCond "%{REQUEST_URI}" "/google.*$"
RewriteRule "^.*$" http://www.google.com [R=301]

The RewriteCond directive has two parts: a string value to check and a substring to search for. In this example, you are looking in the REQUEST_URI HTTP session variable for anything beginning with "/google". If that condition is met, the RewriteRule on the following line is executed. Because you are determining the value of the target URL in the RewriteCond, the value of the target URL in the RewriteRule is defined as "^.*$".

The examples given here are all user-facing events like a 301 redirect. The RewriteRule directive also can be used to proxy requests to a server. This is done behind the scenes unlike an HTTP redirect, so the request is forwarded without the users' knowledge. A proxied request may be configured like the example below:


RewriteRule "/home/(.*)$" http://back-end01.test:8080/$1 [P]

The above illustrates an example of a virtual root directory. When the user accesses anything underneath /home (note the ".*" expression), the request is sent to back-end01.test on port 8080 with the location set to the URL path beneath /home. For example, if the user tries to access /home/test/image.jpg, the request is sent to back-end01.test:8080 with a location of /test/image.jpg. A proxied RewriteRule also may be used in conjunction with RewriteCond for further customization. Note that this statement proxies only the HTTP request. Proxying of HTTP responses will require mod_proxy.

Another option for proxying HTTP connections through Apache is mod_proxy, which provides ProxyPass, ProxyPassReverse and ProxyPassMatch among many other directives that provide more robust proxying options. I focus primarily on these three directives here. As mentioned previously, RewriteRule provides proxying of HTTP requests. Let's compare the example already given for proxying with RewriteRule and an example for ProxyPass:


ProxyPass /home http://back-end01.test:8080/

This ProxyPass statement provides roughly the same level of functionality as the RewriteRule statement with a more simplistic command. When a request comes in for any URL beginning with "/home", the request header will be rewritten so that the request will be received properly by http://back-end01.test:8080/. Consider the following first lines of an HTTP request:


From user to server:    GET /home/test/image.jpg HTTP/1.1
From server to back-end:    GET /test/image.jpg HTTP/1.1

The first line of the header contains the method (GET in this case) and the URL being requested. When the server receives the request from the client, it strips off "/home", as specified in the ProxyPass directive and forwards the request to the back-end server. If you want to proxy response packets as well as request packets, the following ProxyPassReverse statement can be paired with the previous ProxyPass statement:


ProxyPassReverse /home http://back-end01.test:8080/

The syntax is exactly the same as ProxyPass, adding to the simplicity of the mod_proxy configuration. This will take any HTTP response matching an HTTP request for /home and forward the response back to the original client. If you need to add some programmatic proxying (similar to RewriteCond), you can use the ProxyPassMatch. When implementing a forward/reverse proxy configuration, ProxyPassMatch can replace ProxyPass. Here's an example:


ProxyPassMatch "^/home/([a-z0-9]*/docs)" http://docserver01.test:8080/$1
ProxyPassReverse /home http://docserver01.test:8080/

This example suggests that within the /home folder, there are many sub-folders (let's say user names) and within each of those exists a folder named "docs". The USERNAME/docs URL exists on docserver01.test:8080 in the root of the web server, as denoted by the $1 in the server URL. The ProxyPassReverse will function in the same manner as it did in the previous example.

Securing websites with SSL in Apache is accomplished with mod_ssl. Although I won't discuss configuring SSL from the ground up, a few directives relate to proxied SSL connections: SSLProxyCheckPeerExpire, SSLProxyCheckPeerName and SSLProxyCheckPeerCN. It is a common practice to use self-signed certificates on back-end servers (provided a valid cert is in place on the user-facing server), and these directives address common issues that can arise when using self-signed certs. Any of these directives can have one of two arguments provided: "on" or "off". If set to "off", SSLProxyCheckPeerExpire will skip checking the expiration date on the SSL cert used on a back-end server. To avoid checking a certificate's common name or alternate names against the server name used to access a back end, set SSLProxyCheckPeerName to "off". In older versions of Apache, you might be able to use SSLProxyCheckPeerCN (set to "off") instead of SSLProxyCheckPeerName.

Along with rewriting URLs, it may be necessary to rewrite HTTP request or response header fields. In Apache, this is done with mod_headers. There are only two directives within this module: Header and RequestHeader. These directives are used to modify response and request header fields, respectively. Many actions can be used with either of these directives, but here, let's look at the set and edit actions—for example:


Header set ReceiveTime "%t"

This example will add and replace any existing header in an HTTP response named ReceiveTime and give it the value of the UNIX timestamp when the request was received by the server (represented by "%t").

If you need to replace the value of a header that comes from a back-end server, you would use the edit action. Consider the following example:


Header edit Location "^http://back-end01.test:8080/(.*)$"
 ↪"http://public.test/$1"

This example will replace the Location attribute in an HTTP response, which will exist in a 301/302 redirect. If it finds http://back-end01.test:8080 at the beginning of the Location header, it replaces that part with "http://public.test" (the user-facing URL).

Content Integration

Once a remote application is integrated with an Apache server, from a protocol standpoint, it may be necessary to integrate content. This will generally manifest itself as URLs coded into HTML or JavaScript that are specific to a back-end server and not to a user-facing server. The basic necessity is to be able to search and replace bits of HTML or JavaScript content, so that it can render and perform correctly when accessed through an Apache proxy. The module that accomplishes this is mod_substitute and specifically the Substitute directive. Substitute allows a simple regex substitute to be performed on the payload data of an HTTP response.

Something to consider before attempting to replace text is to account for whether the back-end web server compresses data before sending it over the network. If it does, your Substitute statements might not work, as it will be searching for ASCII text within binary compressed data. To account for this, you can instruct Apache to decompress the data, manipulate the response and then re-compress it. This is done using the SetOutputFilter directive, which is part of Apache core functionality. Here's how it works:


SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE

Reading the arguments from left to right, this tells Apache to INFLATE (decompress) the data from the back-end server, perform the substitute and DEFLATE (compress) the data before returning it to the end user.

The Substitute statement uses a regex substitute expression. As I mentioned previously, I found it easier to use the pipe-style substitute expression in Apache. To recap, the syntax is s|search|replace|options. Two common options that I tend to use: "i", which denotes a case-insensitive search, and "n", to allow the search and replace values to be processed as regex. Here's a common use example:


Substitute "s|(href="http)(://)back-end01.test:8080|$1s$2public.test|in"

For this example, let's assume that the user-facing site (public.test) runs HTTPS, and the back-end server (back-end01.test) runs HTTP on port 8080. This would be a solution if the back-end web server returned hyperlinks that were specific to itself as opposed to the user-facing site. In the search portion of the regex substitute, this splits out two groups of text in parentheses: (href=\"http) and (://). These are blocks of text that you want preserved in the replace section of the regex. In the replace, you are inserting an "s" after http and replacing the hostname/port with the user-facing site name. After processing, the resulting string will be href="https://public.test. This will update hyperlinks that use "href" attributes (<a> and <link>). For <img> and <script> tags, you could use this same Substitute statement and replace "href" with "src". Another consideration would be to account for double or single quotes delimiting attribute values (href=' vs. href=").

Another application of Substitute is to extend the functionality of a page without manipulating the original source code. Consider the following example:


Substitute "s|(<body.*>)|\1<div style=\"font-size:14pt;
↪font-weight:bold;background-color:#ff0000;color:
↪#ffffff;display:block;text-align:center;\">This site
 ↪will be down for 24 hours beginning at 8 pm tonight</div>|in"

If a website needs to be taken off-line for maintenance, this is an easy way to alert the user population of the outage without modifying the application itself. This example simply inserts a red bar along the top of the page (right after the <body> tag), which displays information about the outage. Depending on how your page is rendered, you might need to choose another tag to act as your starting point instead of <body>.

Streamlining Future Integrations

All of the topics presented here can be configured and maintained relatively easily if you have only a few statements. In the real world, there typically will be many sites that use a similar configuration and having to define the functionality for each site can be time-consuming and can lead to mistakes. Luckily, Apache provides a mechanism to repeat functionality throughout your configuration through the use of mod_macro. The <Macro> directive within an Apache config functions very much like a function or subroutine. Once a macro is defined, it can be referenced as many times as is necessary, leaving you with one place within your config to maintain your detailed functionality. Here's an example macro:


<Macro RedirectSecure $host $path>
        RewriteCond "%{REQUEST_URI}" "^$path"
        RewriteRule "^/(.*)$" "https://$host/$1"
</Macro>

When called, this macro will define a RewriteCond and RewriteRule that, if they access a URL starting with the value of the $path argument, will redirect the user to http://$host/$1, where $host is the hostname specified as a macro argument and $1 is the entire URL path. The following syntax would be used to call this macro:


Use RedirectSecure public.test /users

Something to consider is the location within the Apache config from which a macro is called. A RewriteRule, for example, cannot be called outside a <VirtualHost> block. As such, if the macro is called outside a <VirtualHost> block, Apache will throw an error and not start. Here's another example:


<Macro ReplaceContentURL $backendurl $publicurl>
        Substitute "s|(href=\")$backendurl|$1$publicurl|in"
        Substitute "s|(src=\")$backendurl|$1$publicurl|in"
</Macro>

This macro expands on the replacing of URLs that I covered previously. This will search for tag attributes of "href" and "src" and replace the hyperlinks of the back-end server with that of the user-facing server. Here's an example of how this might be called:


Use ReplaceContentURL http://back-end01.test:8080 https://public.test

This will search for http://back-end01.test:8080, beginning with either href=" or src=" and replace the URL with https://public.test. Macros can be used for any piece of Apache configuration. They can be used to do small tasks as shown here as well as whole site configurations. Although macros are pretty simple, they make the difference between a large amount of difficult-to-maintain configuration files and a simplified reusable configuration.

At this point, you have some basic knowledge of integrating HTTP, customizing content and reproducing configuration within Apache. Although many directives and modules weren't covered here, this will be a great starting point and can help you get started with accessing your applications through Apache.

Resources

The following are some articles I've found useful along with some example Apache configs I've written.

Apache Module Reference (2.2): http://httpd.apache.org/docs/2.2/mod

Apache Module Reference (2.4): http://httpd.apache.org/docs/2.4/mod

Git Instaweb Reverse Proxy: https://gist.github.com/bng44270/cff67619db3e3f915957

Monit Reverse Proxy: https://gist.github.com/bng44270/287277ea1975b9a3e3526d5a5bcb017c

Adobe Experience Manager Apache Config: https://github.com/bng44270/aem-dispatcher-config

Load Disqus comments