Web Servers and Dynamic Content

Webmaster

by Dan Teodor

on January 31, 2001

When web servers first appeared their primary purpose was to serve up selected information from the machine on which they ran. The idea was to simply take the contents of a file and transmit them over a TCP connection in HTTP format. The inherent limitation discovered early on was that dynamic content could not be delivered, and the CGI interface was defined and added to web servers to address this.

The Common Gateway Interface (CGI) provides a way for the web server to execute a process whose output is determined by the code that it executes, and then take this output and pass it back to the client browser as if it were the contents of a static file. Since that time many variations that combine scripting engines and CGI have evolved that simplify the job of the programmer and make the execution of multiple external threads more efficient (Perl, Python, PHP, etc.). However, for the most part, these scripting languages have the drawback of needing to be interpreted. Also, they overlook the fact that enormous bodies of code already exist in such languages as C and Fortran (remember it?) to solve complex, computationally intensive applications. While building dynamic content based on the results of queries is very useful, it is not realistic to apply image data transformations or calculate Fourier Transforms using these scripting languages, as the time required to complete such calculations is very long with respect to the response time which the user expects.

Therefore, because of the existing code base and the existence of dynamic content that has to be generated by some algorithms that cannot be efficiently implemented in scripting languages, there is a definite need for developing programs that use the CGI, written in a traditional language like C or Fortran and compiled down to native code.

How Web Servers Pass Data to Your Program

There are two ways in which data in the HTTP protocol is passed from the browser to the web server: the GET data (specified as part of the URL, commonly that part that appears after the “?” in the URL) and the POST data (the collected name-value pairs of all of the fields in the form that is being submitted by the web browser). To figure out the GET data, just look at the URL—http://www.mydomain.com/pages/external.cgi?additional-data.

GET data portion: additional-data

To figure out the POST data look at the source of the document generating the request, and find the form that is being submitted:

<FORM>
<INPUT TYPE="text"NAME="fld01"
VALUE="val01">
<INPUT TYPE="hidden"NAME="fld02"
VALUE="val02">
<INPUT TYPE="checkbox"NAME="fld03" CHECKED>
<SELECT NAME="fld04">
<OPTION VALUE="val03"> Val-03
<OPTION SELECTED VALUE="val04"> Val-04
<OPTION VALUE="val05"> Val-05
</SELECT>
</FORM>
                  POST data:
   fld01=val01&fld02=val02&fld03=on&fld04=val04

As you can see, POST data is submitted in the form of a continuous string identifying each field/value pair separated by an equal sign (“=”). Each field/value entity is separated by an ampersand (“&”).

There are actually other complexities involved in the format for passing in POST data. Many characters need to be “escaped” in order not to confuse the web server with control characters or separator breaks. This is solved by inserting plus signs (“+”) for spaces and escape sequences of the format “%[0-9,A-F][0-9,A-F]” in the place of non-printable characters, ampersands, pluses and equal signs. (For the purist, yes, it is true that spaces can be represented as either the “+” symbol or as the escape sequence “%20”. All versions of Apache and IIS with which I have worked accept both.)

The web browser passes GET and POST data to the external thread by different mechanisms. GET data is placed in an environment variable visible to the context local to that thread. This environment variable is “QUERY_STRING”. Therefore, gaining access to GET data in C/C++ is a simple matter of this command:

char *pszGetData = getenv("QUERY_STRING");

(This should work in all UNIX and in all Microsoft development environments.)

On the other hand, POST data is passed to the external thread on the standard input stream. For those of you not familiar with streams, it is the same data producer as the keyboard, so whatever it is you've been doing to read input from the keyboard, that is the mechanism you will use to access POST data. However, an inherent liability with streams is that you have no way of knowing how much data is waiting for you. The obvious solution is to keep reading byte by byte from that stream until there is no more data and keep resizing a dynamically allocated buffer accordingly. However, web browsers provide the developer with another piece of information that saves them the trouble of having to grow buffers, incur the added overhead and deal with handling the exceptions that occur when one of multiple dynamic memory allocations decides to fail.

When a web browser passes POST data to the standard input of an external thread, it places all of the POST data there in one shot; therefore, there is never any chance that additional data will be added to the stream after you have first encountered a data portion on that stream.

The web browser also tells the process just how much data it has placed in the standard input by writing (as ASCII text) the number of bytes waiting to be read from the standard input in an environment variable visible to the context local to that thread. This environment variable is “CONTENT_LENGTH”. Therefore, gaining access to POST data in C/C++ requires a three-step process that will work in all UNIX and Microsoft development environments:

long   iContentLength =
atol(getenv("CONTENT_LENGTH"));
char  *szFormData = (char *) malloc(iContentLength * sizeof(char));
bzero(szFormData, iContentLength * sizeof(char));
fread(szFormData, (iContentLength - 1) * sizeof(char), 1, stdin);

A given browser document can contain both GET and POST data; therefore, both mechanisms may be used at the same time. In the <FORM> tag, a target may be specified which contains GET data, and the data contained in the form will be passed to the target as POST data.

How Your Program Passes Data Back to the Web Server

With the data from the web page obtained, your program can now perform all of its processing and can tell the web server what to reply. That reply may be a simple plain text message, an HTML document (the most common form), an image (typically in GIF or JPEG format) or any other complex data type. These data types are referred to as MIME types, and a standard subset is recognized worldwide on almost all browsers in use today. The mechanism by which your program passes data back to the web server (for transmission to the client browser) is by writing that data out to the thread's standard outstream, the same mechanism that is used to write characters to the screen in your favorite language. The format for this data is simple:

Content-type:[SPC][MIME-type];[CR][CR][Document-Data]

First, your program must declare the MIME type. The most common MIME types are as follows:

text/plain—Plain text that is output as block characters with exactly the alignment used when transmitted (no word wrap).
text/html—Standard HTML document text.
image/gif—Image encoded using one of the Compuserve GIF image specifications (it should be noted that the format uses Lempel-Ziv compression technologies which may not be in the public domain and may require the software producer or software user to license the software from the owner of the patent, which is Unisys).
image/jpeg—Image encoded using the JPEG image standard.

Second, your program must send a semicolon (“;”) followed by two carriage returns (“\n”).

Third, your program must prepare the body of the document you wish to transmit. It may be the content of a plain text or HTML document or the binary data that makes up the raw data block of a GIF or JPEG image.

Therefore, getting the web server to send a simple reply can be as easy as:

printf("Content-type: text/html\n\n<HTML><HEAD>
   </HEAD>"
"<BODY><H3>My Quick Test Page</H3></BODY></HTML>\n");

That's it; those are the basics for telling the web browser what to reply to your client's request. There are, of course, some cute things one can add to this basic format that lends a measure of control over how the document is rendered. One example is the addition of the “charset=” qualifier after the MIME type (right before the carriage return), which ensures that the browser will render the HTML document being transmitted using the appropriate character set (examples are “ISO-;9660-;1”, “ISO-;9660-;2”, “KOI-;8”, “WIN-;1225”, etc.). Therefore, the savvy programmer may wish to send out the document like this:

         printf("Content-type: text/html;
     charset=KOI-;8\n\n"
                "<HTML><HEAD></HEAD><BODY><H3>"
                "<BODY><H3>Maya malinkayaproba
      </H3></BODY></HTML>\n");

Pushing Continual Updates to the Browser

Every so often the purpose of a web page is to monitor some long and involved process that typically takes longer than one time-out period to complete or to generate a full update. This is another situation that can be dealt with well in legacy languages like C/C++ and Fortran. The idea is to force the web server to keep the TCP pipe open to the browser and to keep pushing new documents down to the browser at an interval specified by your program.

The formula to accomplish this, given here, is specific to the Apache web server, which as we all know, is the most popular HTTP dæmon used in the Linux world to date. If you are unsure whether this will work with your particular HTTP dæmon, try it and let me know. Here are the steps:

Rename your program's binary to begin with the characters “nph-;”. This means that if the binary of your program is named “update.cgi”, then change its name to “nph-;update.cgi”.
Transmit the HTTP header that the web server normally hands to the web browser (this is done for reasons that will be explained below):

printf("HTTP/1.0 200 Okay\n");

Define the MIME type of the document as “multipart/x-mixed-replace”:

printf("Content-;Type:multipart/x-;mixed-;replace;"
        "boundary=SoMeRaNdOmTeXt\n");

Initiate the first document transmission by passing the token declared in “boundary”:

printf("\n—SoMeRaNdOmTeXt\n");

Send the next document update. This is simply a document that should be displayed until the subsequent transmission goes out along the same open connection at some point in the future. The update is followed by another instance of the token declared in “boundary”:

printf("Content-type: text/html\n\n<HTML><HEAD>
    </HEAD>"
"<BODY><H3>Update #%d</H3></BODY></HTML>\n"
"\n-SoMeRaNdOmTeXt\n", Count++);

Flush the standard out buffer:

fflush(stdout);

Repeat steps five and six until all updates have been transmitted. On the last update, do not transmit the token simply flush standard output and exit. This will leave the last update in the client browser's window after your program exits.

A simple example of a program that uses server-side push to count on your browser's screen from one to ten with a delay of one second between count updates is shown in Listing 1.

Listing 1. Count to Ten

In order to explain how this works, it is necessary to understand a little bit about what the server does in the background. Up until this point, your program's output was verified for validity (i.e., a proper MIME type, proper separators, etc.) and was passed on to the client browser with some additional HTTP headers pre-pended to it. In order to take more control over the web server/client browser interaction, we must ask the web server to stop performing these validity checks and to stop adding its normal headers. This is what the “nph” stands for your program's new filename No Parsed Headers. When the name of your program begins with the letters “nph-”, this means that the web server now assumes that your program is responsible for performing all of the validation checks and header transmissions that would normally be the responsibility of the web server. The web server will simply keep the TCP pipe open to the client browser and grab data as it comes out of your program's standard output stream and pushes it down that TCP pipe to the browser. We are now in a position to understand what is happening in step two; this is a required header that is normally transmitted by the web server and was completely transparent to the program hiding behind the CGI.

Next, we must tell the client browser to expect continual updates, not just one single burst of data...and, therefore, it must not close the TCP pipe once the first document has been transmitted. This is accomplished by specifying the MIME type of the document as being “multipart/x-;mixed-;replace”. In addition, we need to tell the browser how to differentiate between the documents in the stream of multiple documents about to be transmitted. This is accomplished by attaching the qualifier “boundary=SoMeRaNdOmTeXt” to the MIME-type declaration. This tells the web browser that anytime it encounters the sequence of bytes “--SoMeRaNdOmTeXt” in its input stream, it should stop and assume that the following data will describe a new document that will replace the one which currently exists in the document window.

The string that separates the end of one document transmission and the beginning of the next is usually referred to as a boundary token, and this token is normally much more complicated than the one shown in our example here. Normally it is a 50- or 60-byte-long alphanumeric string generated by a randomizer function and will be presented later in this article. The string should be sufficiently long and its contents sufficiently random so as to minimize the chances that it will accidentally occur as part of the body of your document.

Finally, once the document has been pushed to standard out and the boundary token has also been pushed out, it is necessary to flush the output buffer in order to ensure the data gets sent to the client browser. If this is not done, the data will not be sent until the stream's buffer implementation that your operating system uses overflows, and a flush is triggered by the operating system.

Bulletproofing and Parsing Web Server Input

The greatest deterrent to writing web programs in these legacy languages, and probably the greatest driver behind the development of Perl and PHP, has been the difficulty and security risks involved in developing applications that have the smarts and know-how to parse and avoid hacker attacks when data is passed to them from the web browser using only environment variables and the standard input stream.

The first thorny issue that must be solved is an easy and memory- efficient method of parsing up this data so that one can simply select the field they are looking for and obtain the data in a one-shot, one-kill fashion. In addition, certain security issues need to be plugged, such as data overruns from a misbehaving client browser intended to overwrite application memory with the overrun data (or deny service).

I present here, for your browsing pleasure, a series of functions that provide just such a safe and secure one-shot, one-kill approach to obtaining POST data in these legacy languages. The specific example I present is in C but can easily be ported to Fortran or wrapped for C++:

char  *TextField    = GetFormStringValue("TextField");
int    NumericField = GetFormIntegerValue("IntegerField");
float  FloatField   = GetFormFloatValue("FloatField");

The source for these functions is shown in Listing 2 and the source for their support functions is shown in Listing 3. [Due to the length of Listing 2 and 3, they are available from our ftp site, ftp.linuxjournal.com/pub/lj/listings/issue82.] All of these functions have been tested to work equally well in UNIX and Windows development environments and both compensate for both buffer overruns and underruns. When any of these functions are first called, dynamic memory allocation to capture and parse the POST data is performed in the background. Its parsed form is then held in memory and, on subsequent calls to any of these functions, simple linear scans of the fields in this memory space are performed. Memory allocation is performed only once, and all conversion of escape sequences and special characters is performed linearly within this memory space (no other temporary space is used to accomplish this).

Since the example shown here is in simple C, which cannot provide automatic desructors the way that C++ can, it is necessary to call one cleanup function when your program exits: ReleaseFormData()

This is necessary to release the dynamically allocated memory buffer. If these functions are ported to a C++ class, it is simply necessary to call this function in the destructor method of the class to which the POST data access functionality is ported. Therefore, a simple framework for your legacy language CGI program is shown in Listing 4.

Listing 4. Legacy Language CGI Framework

Future Topics

Of course, we have only scratched the tip of iceberg with what is possible when you unleash the power of a fast and efficient language like C/C++ for development of web application, without the added drag of having to perform all of the mundane jobs normally performed by a script interpreter. It is easy for us to see why we need to expand this discussion to include the following:

Using the local file system to maintain “state” for your CGI programs.
Why state can be maintained on the local file system in Linux without the concerns for disk overhead one might have on other operating systems.
Creating, modifying and destroying cookies on your client browser from your CGI programs.
Setting up security so that only you and the CGI program can access the state information in the files on your local file system and nobody else.
Thinking ahead to lightweight threads and fast-CGI.

Dan Teodor makes his living doing management consulting for PriceWaterhouseCoopers in Houston, Texas. He has been a closet Linux geek since his graduate school days and the Slackware 0.x kernel releases. While big on large scale web app deployment and funky commerce architectures, he dreams about skiing New Mexico and launches a new bankrupt dot-com once a year.

Load Disqus comments