At the Forge - Checking Your HTML

Integrate HTML validation into your test suite for better HTML from the get-go.

We say that Tim Berners-Lee invented the World Wide Web, and that's certainly true. But, we can boil the Web down to three specific technologies: URLs (for uniquely identifying resources on the Internet), HTTP (a stateless protocol for transmitting documents) and HTML (a markup language). Each of these inventions was simple to understand, as well as simple to implement. And, it is this combination of simplicity and elegance that has made the Web the success that it is.

All three of these technologies have evolved over the years, reflecting new uses and needs. For example, HTTP now supports a system of “headers” in both the request and response, which can do everything from indicate the content type of the response body to provide hints regarding how long the data should be cached.

HTML has grown up quite a bit as well, evolving to become a truly semantic markup language (with styling information moved to external CSS documents) with a more rigorous and standardized definition. Standardization has made HTML slightly harder to write, in that you need to be more careful about items, such as tag names (keeping them lowercase), attributes (because not all are valid in all contexts) and closing tags. One advantage to such standardization is that we now can predict to a much greater degree what pages will look like across different browsers. Sloppy HTML means that the browser has to decide what you meant, which can have consequences that vary widely in their influence on the way the page looks.

More significant, the rise of AJAX as a paradigm for Web development has made it increasingly important that HTML be well formed. Many AJAX-related routines need to modify a particular element on the page in some way. The easiest way to do this typically is to grab the element via its id attribute, which is guaranteed to be unique. (If you want more than one element to use an ID, you really should use a class instead.) In the last few months, I have worked on a number of pages that had duplicate ID attributes. Sometimes this was the result of a simple mistake, and sometimes it resulted from ignorance on the part of a Web designer. But in all cases, this meant that my JavaScript performed differently from what I expected.

Although HTML validation might seem boring, it's actually an essential part of getting AJAX-powered, latest-paradigm, super-fancy Web sites to work. This month, I review a few tools I use to make sure the HTML I create is as standards-compliant as usual. I begin with some simple, manual tests that can run on individual pages. Then, I show some automated tools I use when developing applications in Ruby on Rails, allowing me to check the HTML of all of my pages en masse, including those that require password protection to access.

HTML Standardization

Before continuing, it's important to realize that HTML is a catchall term for many different, related markup languages. And, when I say markup, I mean that HTML is a language used to describe a text, identifying its different parts. For instance, a newspaper article will have a headline, one or more authors, one or more paragraphs of text, zero or more photographs, and one or more captions per photograph. A markup language doesn't add content to a document, but rather describes the individual parts of the document, so that they can be laid out and displayed in an appropriate way. In this sense, HTML is a direct descendant of SGML, a markup language that was developed many years previously, but which was far more difficult to work with.

Although there have been several versions of HTML over the years, let's focus on the ones that are most widely used today. Perhaps the most common version of HTML is an unstructured, unversioned, nonstandard document. I'm certainly guilty of creating many such documents, which look like this:


<html>
<head>
    <title>This is the title</title>
</head>
<body>
    <h1>This is the headline</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
</body>
</html>

Nothing is wrong, per se, with the above document. But, because it fails to indicate which version of HTML it is using, browsers must make a variety of assumptions. These assumptions can make it hard to predict how different browsers will operate, using something known as quirks mode.

Fortunately, we can choose a standard implementation and indicate that to a browser by adding a DOCTYPE declaration at the top of the document. When assigning the value of DOCTYPE, you need to decide whether you will use HTML or XHTML (that is, an XML-compliant version of HTML), and whether you want the strict, transitional or frameset variety of that markup language.

The strict version of each markup language is the ideal version that allows no styling elements. On a modern site, such styling should be defined in CSS, not in HTML. However, it may be difficult for some sites to comply with the strict definition, either because their authoring tools use tags that aren't allowed in the strict definition, or because the site's authors want to use forbidden elements, such as those for embedded Flash. To make the transition to strict HTML easier, the standards allow for transitional HTML, which provides a larger number of tags.

Let's define our tiny document as follows:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
    <title>This is the title</title>
</head>
<body>
    <h1>This is the headline</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
</body>
</html>

The <!DOCTYPE> declaration at the top of the page tells browsers (and any other programs that might try to parse the page) that we want to follow the standards, but that we'll do so using the transitional declaration.

Once we have indicated our willingness to apply the transitional standard, we may discover that our documents are no longer valid. For example, if I include an image in my HTML document:


<img src="/images/foo.jpeg">

With the above line inserted into my document, it is no longer valid, because it is missing an alt attribute. Once I add that attribute, the document is valid:


<img src="/images/foo.jpeg" alt="foo">

However, we can get even better results if we enforce XML considerations and declare our document to be XHTML transitional. To do that, we modify not only the !DOCTYPE declaration, but also the <html> tag:


<!DOCTYPE html
          PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>This is the title</title>
  </head>
  <body>
    <h1>This is the headline</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
    <img src="/images/foo.jpeg" alt="foo">
  </body>
</html>

Suddenly, our document is invalid again. Because we have declared it to be XHTML transitional, we need to follow XML rules. We need to close our <img> tag, most easily accomplished by using the self-closing syntax:


<!DOCTYPE html
          PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>This is the title</title>
  </head>
  <body>
    <h1>This is the headline</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
    <img src="/images/foo.jpeg" alt="foo" />
  </body>
</html>

With that in place, our document is now valid. As you can imagine, finding all the problems that might occur in a document can be difficult, even for someone who is trained and experienced. Trying to check all the pages on a site, particularly one that contains hundreds or thousands of pages, clearly would be impossible.

The solution, then, is to have a program check the pages' validity automatically, preferably as part of your automated tests. This way, you can discover when you have problems quickly and easily.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState