Writing Man Pages in HTML

HTML is a cool way to look at the Linux man pages. Here's how to do it.
Technical Overview

This section will cover a few of the implementation details of vh-man2html. It's very brief and is really intended to point out that CGI scripting is something that anyone with a little programming knowledge can do with success.

Without getting into a tutorial on CGI scripting, a CGI script is a program executed by the remote HTTP daemon (i.e., web server). A web browser can cause a remote web server to run a CGI program when you follow an HTTP link that matches its name. For example, pointing a web browser at:


executes cgi-bin/man2html on Caldera's web server. The CGI programs that a web server is prepared to run are usually restricted to those found in cgi-bin directories on the server.

The CGI script can return output to the remote caller by writing a document to its standard output. The start of the output document must contain a small text header describing its contents. In the case of man2html the content returned is an HTML page. Listing 1 shows the HTML output from the man page to HTML converter; the header line is:

content-type: text/html

and the rest of the document is normal HTML, which consists of text marked up with HTML tags. With some web browsers, you can use options like Netscape's “View Document Source” to inspect this HTML source.

A script may create an HTML page that contains further references to other CGI scripts. In Listing 1 the following reference returns the reader to the main vh-man2html contents page:

<A HREF="http:/cgi-bin/man2html">Return to Main

A CGI script receives input that may have been embedded in the original reference or that have may have been added as a result of user input. For example, in Listing 1, the “SEE ALSO” section directs the cgi-bin/man2html program to return the HTML for a specific manual page:

<A HREF="http:/cgi-bin/man2html?man1/from.1l">
In this case the HTTP reference is supplied with a single parameter “man1/from.1l”--the name of a man page. The start of the parameter list is delimited by a “?”. If there were more than one argument, they would be separated by “+” signs (and there are conventions for how to pass special characters such as “+” and “?” as parameters). The CGI program won't see any the of delimiting characters; it just receives the parameters as arguments in its normal argument list (or optionally via standard input). This means the CGI script doesn't have to concern itself with how its input got delivered over the network, it simply receives it in the form of command-line arguments, standard input, plus a variety of environment variables.

In addition to clicking on references, the user can also enter data into input fields. The simplest way for a CGI program to introduce an input field onto a form is to include the tag <ISINDEX> in the HTML it generates. This results in a single input field, such as in Figure 1. If the user enters anything in the input field and presses return, the server will re-run the CGI program, passing it the input via the parameter passing conventions we've just discussed. You can also create HTML forms, but I'm not going to discuss them here.

By generating the kinds of HTML references presented above, CGI programs can perform complex interactions with the remote user. The beauty of all this is that, to get started, the only skill you need is the ability to write fairly simple code in a language of your choosing. You need to know how to process command-line arguments and write to standard output. The rest of the knowledge you need can be gotten for free from Web documents or from any one of a number of books on HTML and CGI. CGI is a client-server that actually works. Heavy duty CGI programming languages such as Python and Perl have tools and libraries to assist you with the task.

I should also mention the issue of security. If your HTTP daemon is accessible by potentially hostile users, your CGI scripts could provide an avenue for them to attack you. Hostile users might try to supply malicious parameters to your CGI scripts. For example, by using special shell characters such as back quotes and semicolons, they might be able to get the script to execute arbitrary commands. The only way to prevent this is to carefully examine all input parameters for anything suspicious. For example, vh-man2html can be passed the full file name of a man page; however, it doesn't just accept and return any file name it is passed—it accepts only those filenames present within the man hierarchy. The program also makes sure the file name does not contain relative references such as “..” (the parent directory), and removes any suspect characters such as back quotes that might be used to embed commands in the parameter list. In languages like C, where memory bounds checking is lacking, the length of the input arguments should be constrained to fit within the space allocated for them. Otherwise the caller may be able to write beyond the allocated space into other data and change the behavior of the program to his/her advantage (e.g., change a command the program executes from gzip to rm). To help check that long input parameters wouldn't threaten vh-man2html's integrity, I borrowed some time on an SGI box and built vh-man2html with Parasoft's Insight bounds checker. Insight pre-processes a C or C++ program adding array bounds checking, memory leak detection and many other checks. One of the reasons I'm mentioning Insight is that Parasoft's Web site, http://www.parasoft.com/, lists Linux as a supported platform.

vh-man2html includes four CGI programs. They all generate interdependent HTTP references to each other.

Man page to HTML translation is handled by the man2html C program. The Unix man pages are marked with man or BSD mandoc tags which are nroff/troff macros. The bulk of the program is a series of large case statements and table lookups that attempt to cope with all the possible macros.

Listing 2 shows a typical nroff/troff marked up manual page that is using the man macro package. The macros use a full-stop, i.e. a period, as a lead-in to a one or two character macro name. troff/nroff uses two character macro names—apparently they fit nicely into the 16-bit word size of the old Unix platforms such as the PDP11 (at least that's what I was told)--a trick which man2html.c still utilizes. Some of the macros can be directly translated to appropriated HTML tags; for example, lines beginning with “.SH” Section Headings are directly translated to HTML <H1> headings.

Many troff tags limit their arguments and effects to just one line and have no corresponding end tag—where as many of the equivalent HTML constructs also require an end tag. For example, the text following a troff “.SH” section heading tag needs to be enclosed in a pair of HTML heading level 1 tags, e.g., “<H1>text</H1>”. Other troff tags with a larger scope, such as many kinds of lists, have both begin and end tags, which makes translation to HTML very easy.

One tricky issue is dealing with multiple troff tags on one line; for example, tags that imply bracketing of following text or font changes. In order to correctly place bracketing, the translator can work recursively within a line. For example, the BSD mandoc sequence for an command option called -b with an argument called bcc-addr is expressed in troff as:

 .Op Fl b Ar bcc-addr

which indicates the reader should see:

[ -
where b is in bold and bcc-addr is in italics. The corresponding HTML is:
[ -<B>b</B> <I>bcc-addr</I> ]
By using recursion on hitting the Op tag, we can get the square brackets on the beginning and end of the entire line.

There are some troff tags whose effect is terminated by tags of equal and higher rank; in these cases, the translator must remember its context and generate any necessary terminating HTML. Nested lists are also possible. In these situations man2html has to maintain a stack of outstanding nestings that have to be completed when a new equal or higher element is encountered.

I admire Richard's dedication in methodically building up translations of all of the tags. Adding in the BSD mandoc tags proved to be a painful experience, and in the end, the only way to get it right was to convert every BSD mandoc page I could find and pipe the output to weblint (an excellent HTML checker). For example, in tcsh/csh:

foreach i ( `egrep -l '^\.Bl' /usr/man/man1/* \
        /usr/man/man8/*` )
/home/httpd/cgi-bin/man2html $i > tmp/`basename $i`
weblint tmp/*

If you want to sample the spectrum of mandoc translation, look at any of the pages the above egrep locates—telnet, lpc and mail are good examples.

man2html.c also has to do minor translation fix ups, such as translating quotation marks and other special punctuation into HTML special characters.

In the end, to test the sturdiness of the translator, I converted every man page I have:

find /usr/man/man* -name '*.[0-9]' \
  -printf "echo %p; /home/httpd/cgi-bin/man2html\
        %p | weblint -x netscape -\n" | sh \
        |& tee weblint.log

These tests proved quite useful in exposing bugs.

The program also has to navigate the man directory hierarchies and generate lists of references to pages that might be relevant (e.g. a page with the same name might be present in multiple man hierarchies). The list of man hierarchies to be consulted is read from /etc/man.config, which is the standard configuration file for the man-1.4 package that ships with Redhat and Caldera. This configuration file is also consulted for details on how to process man pages that have been compressed with gzip or other compression programs.

man2html could have easily been written in Python or Perl, but you can't beat C for speed. man2html is fast enough on my 486 that I didn't think caching its output was worthwhile—each page is just regenerated on demand. However, if I was going to provide man pages from a server for a large number of high frequency users, I would probably pre-generate all the man pages as a static document set.

Two awk scripts, manwhatis and mansec, generate name-title and name only indexes for man sections and cache them in /var/man2html. manwhatis locates and translates whatis files into the desired section index, which it caches in /var/man2html. It rebuilds the cache if any whatis file has been updated since the cached version was generated. The script divides the whatis file alphabetically and constructs an alphabetic index to the HTML document, so that the the user can quickly jump to the section of the alphabet they're interested in.

mansec traverses the man hierarchy to build up a list of names; it rebuilds its cache if any of the directories in the hierarchies have been updated. mansec has to use the sort command to get the names it finds into alphabetical order. It also builds an alphabetic quick index just like manwhatis.

Both manwhatis and mansec accept an argument that indicates which section to index. They have to check the argument for anything potentially malicious and return a document containing an error message if they find anything they weren't expecting:

section = ARGV[1];           # must be 0-9 or all.
if (section !~ /^[0-9]$/ && section != "all") {
 print "Content-type: text/html\n\n";
 print "<head>";
 print "<title>Manual - Illegal section</title>";
 print "<body>";
 print "Illegal section number '" section "'." ;
 print "Must be 0..9 or all";
 print "</body>";

The mansearch script is an awk script front end to the Glimpse search utility. It accepts user input, which it passes onto Glimpse, so I had to be careful to include code to check the input for safety before invoking Glimpse. This basically means excluding any shell special characters or making sure they can't do anything by quoting them appropriately. For example, in awk we can silently ignore any characters that we aren't willing to accept:

# Substitute "" for any char not in A-Za-z0-9
# space.
string = gsub(/[^A-Za-z0-9 ]/, "", string);
I chose awk over Python and Perl mainly because it is small,
widely available and adequate for the task. Note that I'm using the post
1985 "new awk". For larger, more complex CGI scripts I'd probably use
Python (if I had to start again without Richard's work, I think
man2html would be a Python script).
In order to make vh-man2html usable remotely, I changed man2html and my
scripts to generate HTTP references that were relative to the current server.
For example, I used:
<A HREF="http:/cgi-bin/man2html">Return to Main
rather than
<A HREF="http://localhost/cgi-bin/man2html">Return to Main
which works fine except for “redirects”. A redirect is a small document output by a CGI script. This is an example redirect:
Location: http://sputnik3/cgi-bin\
A redirect has no context, so the host has to be specified. man2html generates redirects when a user enters an approximate name such as “message 1”. The redirect corrects this to a full reference such as the one above. The server name is obtained from one of the many environment variables that an HTTP server normally sets before invoking a CGI script.