A Recipe for Making Cookies
The overwhelming majority of URLs begin with the letters “http”, which stands for “hypertext transfer protocol”. Just as e-mail is transferred using SMTP (Simple Mail Transfer Protocol) and files are often retrieved using FTP (File Transfer Protocol), files written in HTML are generally transmitted using HTTP.
Why did the Web's inventors create a new protocol for transmitting hypertext, rather than sticking with previous ones? One answer is that they were interested in allowing servers to respond quickly and efficiently to requests from browsers. The client (browser) side of an HTTP transaction consists of a request for a document, containing several optional parameters, describing the document's content type and its last modification date. The server responds to the request by describing the document, including its content type, and returning the document. Once the document is sent, the server closes the connection. By exchanging a minimum of information and then breaking the connection, documents are transmitted with a low overhead, and thus, at a relatively fast clip.
This “statelessness”--the fact that each connection is used to transmit a single document and that each transaction takes place in a vacuum—was a terrific idea in the early days of the Web. It meant that browsers and servers had to keep track of very little information when transmitting documents, thus reducing the size and increasing the speed for these programs.
As a result, if we look at the access log from a typical web server, we see a list of document requests as well as the IP address (i.e., the number that uniquely identifies a computer on the Internet) of the computer from which the request originated. We do not, however, know whether three requests made from the same computer at roughly the same time were made by the same person or by three different people.
In many cases this would not be a problem; after all, if my web site is set up to serve out pages of HTML, then I probably don't care whether 1,000 different people have visited my site or if the same person has read 1,000 documents. For many sites statelessness does not present any obstacles.
However, many site owners, particularly commercial ones, are increasingly frustrated with the Web's inherent statelessness. It is much easier to sell advertising when you have a precise count of the number of people visiting your site, rather than a list of how many times each document was accessed. The number of “hits”, or individual HTTP requests received by a server, is a reasonable measure of a site's success only in the non-profit and personal sector; commercial sites are far more interested in how many pages were viewed by a given number of individuals.
Even small personal sites occasionally like to keep track of users. If you want to personalize a user's view of your site, a way to keep track of each user's preferences rather than a setting which applies to all users. And, while you could certainly get a user's name (and password, if necessary) via HTML forms, forcing the user to enter this on every page, or even upon arriving at your site's home page, would be a great burden on the user.
This month we will look at one of the most popular ways to keep track of user state, best known as HTTP cookies. Cookies allow servers to store small pieces of data on the user's computer, and thus to keep track of a user's movements on our site. Note that while cookies can be used to keep track of a user's movements, and potentially build a profile which might be of use to advertisers, they cannot collect any information which the user does not provide. Fears of privacy abuse might be true in some cases (and designers should recognize that cookies will offend and upset some users), but the fear that cookies can somehow collect information from your computer without your knowledge is off the mark. Cookies simply make it much easier to create interesting sites.
Cookies are small (up to 4KB) pieces of data stored on the user's computer by his browser. In addition to a name,value pair, cookies are tagged with expiration dates limiting the length of time they may be stored, as well as an indicator of the Internet host or domain that originally created the cookie.
The basic rule to remember when dealing with cookies is that the value of a cookie is set by the server using HTTP responses, and browsers return those values using HTTP requests. It's a bit disconcerting to think of things this way; we are not used to responses from servers containing a request of their own.
Let's say that we have a CGI program that returns a small bit of HTML when invoked. Assuming that the program is in the /cgi-bin directory and is called sample.pl, our browser would retrieve it by connecting to the server on port 80 and issuing a request like this one:
GET /cgi-bin/sample.pl HTTP/1.0
This request says that we are using HTTP 1.0 and would like the server to send us the document /cgi-bin/sample.pl. The server, because of its configuration options, knows that anything in /cgi-bin is a program, and so it executes sample.pl, returning the output. Here is an example of what sample.pl might return:
HTTP/1.0 200 OK Content-type: text/html <HTML> <Head><Title>Test</Title></Head> <Body><P>Test</P></Body> </HTML>The above is about as minimal as a modern HTTP transaction can get. A single header (Content-type) following the status code and preceding the message body is returned. Most of the time, more information is included in the response headers, such as the server name and version number and the date on which the document was created. If the server wants to set a cookie on the browser's computer, it must include an additional header, named Set-cookie. Just as the Content-type header defines the type of data that is being returned in the response, the Set-cookie header defines the name and value for a cookie that applies to the site from which the response originated.
For example, Listing 1 contains a short program (cookie-test.pl) that creates a cookie on the user's computer. If we run cookie-test.pl from a web browser, we see the HTML output produced by the program. If it were not for the program's polite indication that it had set a cookie, we would never know unless we asked our browser to warn us each time. (I tried this feature on discovering it in Netscape Navigator 3.0, but I quickly turned it off when I discovered how often such dialog boxes were interfering with my web browsing and how innocuous most of them appeared to be.)
The Set-cookie header becomes obvious if we use telnet to look at the output sent by the program. From my computer running Red Hat Linux 4.2, I type:
telnet localhost 80
which opens a connection to the Apache HTTP server running on my computer. I then type:
GET /cgi-bin/cookie-test.pl HTTP/1.0followed by two line-feed characters, which indicates the end of my request. As in the example above, my server knows that anything in /cgi-bin is actually supposed to execute cookie-test.pl and to send the output from that program to the user's browser. When I enter the above request and press the return key twice (once to end the request line and another to indicate that we have finished the entire request), I get the following:
HTTP/1.1 200 OK Date: Tue, 23 Sep 1997 09:15:42 GMT Server: Apache/1.2.4 Set-cookie: counter=1; path=/cgi-bin/ Connection: close Content-Type: text/html <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <HTML><HEAD><TITLE>Cookie set</TITLE> </HEAD><BODY><P>The cookie named "counter" has been set to 1.</P> </BODY></HTML>Connection closed by foreign host.The above response is a bit more detailed than the skeleton response that we looked at above, but its contents should still be fairly clear. We get a 200 (“OK”) message from the web server, the date at which the document was created, the server that produced the response, the connection type and the content type.
The Set-cookie header tells our browser that we should now hold onto a cookie named counter, whose value should be 1. In the future, every time my browser asks for a document in the cgi-bin path from this host, it will send the value of the counter cookie, which will still be set to 1. Listing 2 is a short CGI program that prints the names and values of all cookies sent to it.
Note that our program only sees a single cookie, while I know that my browser has held onto far more cookies that this. I do not have to enter my password when entering certain sites at which I have registered, and there are a large number of cookies in ~/.netscape/cookies, the file in which the Netscape's browsers place cookies. Why does only one cookie appear?
The answer is that when a browser visits a site, it only sends those cookies that were created by that site. Thus, when I am accessing my local web server, only those cookies created by my local web server are available to CGI programs there. If I were to access The New York Times, only those cookies set by the nytimes.com domain would be available to their system. One of the cornerstones of the cookie specification is that the cookie name and value pair should only be sent to the hosts or domains that created them.
Those of you worried that information about your web browsing interests is shared among sites (and thus violate your privacy) might still be right—but this cannot happen automatically with cookies unless all of the sites you visit are within the same domain. Indeed, HotWired used to have several sites with a shared password system that relied on cookies. Because the sites all had different domain names, however, I had to have a separate cookie on my system for each sub-site. The result was that I had to enter my user name and password the first time I visited each of these sites.
Now we know how to create cookies. Our CGI program uses CGI.pm's cookie method to create one with a name and value, and then puts it into the header returned to the browser. We also know how to write programs that can grab cookies' values. To get a list of all cookies, use the same cookie method, then iterate through the list of names that it returns. Once we have those names, we can retrieve the values with the cookie method, giving it an argument of a particular cookie name.
Storing a value isn't of much use unless we can also change it. Our next task is to combine parts of cookie-test.pl and show-cookies.pl into a single program which increments the counter cookie value each time we visit the site, displaying its value each time. The first time we visit this CGI program, it sets the cookie's value to 1, the second time we visit, it will set the value to 2 and so on.
You can see a bare-bones attempt at this sort of program in Listing 3. As you can see, the code is fairly straightforward. We create an instance of CGI and use the cookie method to extract the value of the counter cookie. We increment that value by 1, create a new counter cookie with the updated value and send that value back as part of the header to the program's response. The body of the response contains a short listing of the names and values of each cookie in the system.
Each time we invoke update-counter.pl, the user's browser recognizes counter as a cookie with the appropriate host or domain name and path, and thus sends counter as a cookie with its request. Update-counter.pl grabs the value of the cookie if it exists and sets it to 0 if it does not. It then increments the value of the cookie and creates a new (outgoing) cookie with the counter name and the updated value. This new cookie is included in the headers which update-counter.pl sends to the user's browser, and the value of the cookie is displayed in the body of the response which contains HTML-formatted text.
This program may not seem very useful, but with a few small variations, it could be useful in a plethora of situations. For example, you could ensure that users only enter a questionnaire once or keep track of how many times they have requested technical support via the Web rather than by telephone. Another possibility might be a web-based quiz game which presented questions one at a time. You could keep track of a user's score with cookies. Alternatively, you could keep track of which questions the user had already seen, so as not to ask the same question twice. You could even keep track of the user's high score, giving a special message if and when the user achieves a new high.
If we were interested in keeping track of multiple values, we could simply create a number of separate cookies. The cookie specification indicates that each host or domain can store up to 20 cookies. Except for a note in the CGI.pm documentation indicating that some versions of Netscape Navigator place a much lower limit on this number, in practice, storing multiple cookies is easily possible.
How do we do this? First, we create a table in our relational database that gives a unique identifier (known as a primary key) to each row in the table. For example, if we want to keep track of each user's first name and favorite color, we can create a table using the following statements in SQL:
create table user_table (user_id mediumint auto_increment primary key, user_name varchar(60) not null, user_color varchar(10) not null);
With the above table created in our database, we can create an HTML form into which a user enters his or her name and favorite color (see Listing 4).
Now that we have an HTML form that allows users to submit their name and favorite color to a CGI program, we need to write that program, submit-cookie.pl (see Listing 5). The program first checks to see if the user already has a cookie; if so, it simply updates the existing elements in the user table. A more robust version of this program would check to see if an entry in the table really existed or if the cookie value was not valid for our site.
If no user_id cookie exists, submit-cookie.pl needs to create a new entry in the database table and return a new cookie assignment to the user's browser. We thus insert a new row into the table whose values depend on the values submitted from the HTML form. When we have completed sending our SQL query to the server, we ask the server for the unique ID used when inserting our row, which serves as a user ID and is stored in the user_id cookie by the user's browser. We get this value by using the Mysql insert_id method, which tells us exactly this piece of information. Once we have this information, we create a new cookie and return it as part of the HTTP headers in the response to the user.
In either case—whether we create a row in the table or update an existing row—the user is presented with a link to homepage.pl (see Listing 6), a personalized home page program that displays the information we have collected. Remember, none of this information is stored in the cookie on the user's computer. Rather, the information is stored in a table in our relational database with the index stored in the user's cookie file.
Obviously, storing the user's name and favorite color are just examples. A site could allow users to indicate a set of preferences and use a database to choose graphics, text and even hyperlinks based on those preferences.
That's about it for our gentle introduction to cookies and what they can do for you and your web site. There are a few elements of cookie creation and administration that I did not go into, such as expiration dates and security, but those are easily understood after reading one or more of the specifications mentioned in Resources. Suffice it to say that anyone interested in keeping the values of cookies past the current invocation of a browser must handle expiration dates, since cookies created without them only last until the user quits the browser.