Fixing HTML with the WDG HTML Validator

by Don Marti

Apple's decision to promote KDE's KHTML rendering engine and, by extension, the KHTML-based Konqueror to Major Browser status makes web standards important again. Webmasters who test in only Microsoft Internet Explorer are going to have Linux and Macintosh users, not only the former, complaining about broken HTML.

Fortunately, the new Apple browser generates a confusing User-Agent header, which helps discourage "browser sniffing". Better to make your site correct, anyway.

If your site obeys the standards and a browser messes it up, you can count on the now-competitive browser developers to fix it. If your site is incorrect, expect complaints.

So, how do you make sure your site is valid HTML and not simply cut-and-pasted, "looks fine to me" HTML?

Liam Quinn's WDG HTML Validator is written in the nearly-ubiquitous Perl and runs as a CGI script, so you can install it on one system and use it from anywhere. You don't need to install it on your production web server, any system on the Net will do.

You can try out the Validator on the WDG site, but if you have a lot of pages to fix, it's faster and more polite to install it at your own site. WDG also has a nifty set of HTML tag reference pages, linked to and from the Validator results, that help you understand and fix your mistakes. I installed it in minutes from the Debian packages; RPMs also are available.

How Well Does It Work?

I dropped in the URL of my fresh, clean personal home page, created with stylesheets in my best attempt at HTML 4.01 Strict, and foolishly expected it to validate cleanly. No way. The Validator started complaining beginning at the <body> tag.

Error: there is no attribute BGCOLOR for this element.

What? I've been putting bgcolor in body tags almost as long as I've been writing HTML! Time to hit the book, Dynamic HTML: The Definitive Reference, 2nd Edition and see what's up. Aha! This attribute is deprecated in HTML 4.01, and I'm using "strict" DTD, so it's time to move bgcolor to the stylesheet where it belongs. It's not a big thing, but it makes the actual page a little smaller and lets me change all the colors in one place.

body {
    background-color: #aaaaaa;

Now, next time Talk Like a Pirate Day rolls around, I can change everything to white text on a black background with a single edit to the stylesheet, then concentrate on me prose, mateys.

But what's this? My page has a link to to easily catch up on the Linux-related news. But the Validator says:

Error: general entity scoring not defined and no default entity

Fortunately, that's in the Common Problems section. Time to replace that ampersand in the link with an &amp; entity. Here's another one:

Error: element NOBR undefined

I hit the book, and it turns out the <nobr> tag was never standardized at all; it's "folk HTML" that browsers happen to recognize. In this case, I'll delete the tags, chill and let the browser flow the text the way it wants.

The next item on the list of HTML mistakes occurred on line 129:

Error: end tag for TT omitted, but its declaration does not permit this

followed by this on line 131:

Error: end tag for element TT which is not open

Aha! This plainly is sloppy HTML. I had a <tt> started inside a <p>, but the </tt> was after the </p>. It looks fine in the browser I use, but this kind of mistake is exactly the kind of error that makes different browsers react differently. Remember, most of the significant differences among browsers are in how they react to mistakes and not how they deal with correct HTML. Before you start sniffing User-Agent and such ugliness, make sure your pages are standard.

Next, in a line with a <blockquote> tag, there's

Error: character data is not allowed here

Checking the reference page linked to from the Validator results, here's the problem: "The content of the BLOCKQUOTE element should be contained within other block-level elements, typically P." Time to make sure that instead of using <blockquote>, I'm using <blockquote><p>.

After a few more errors, the process gets tedious. Why didn't I validate this thing to start with and fix errors one at a time? Why did I write a quick-and-dirty conversion script that wasn't careful about matching <p> and </p>? A quick detour to a friend's page shows his first error on line 1. Ha! It's not only me.

All along, though, the Validator output makes it easy to track down the problems. Using Mozilla tabs, I can pop between the page in the browser and the Validator results.

Finally, the "Congratulations, no errors!" message appeared. Fixing a personal home page is a tiny amount of work compared to repairing damage in a deep, automatically-generated site. When writing software to crank out HTML automatically, it's worth the extra time to feed the output through the Validator to make sure it's right from the ground up, instead of having to troubleshoot when a new browser, or new version, comes along.


Buy HTML: The Definitive Reference, 2nd Edition from Powell's, our partner bookstore.

Don Marti is editor in chief of Linux Journal.

Load Disqus comments