At the Forge - Checking Your HTML
We say that Tim Berners-Lee invented the World Wide Web, and that's certainly true. But, we can boil the Web down to three specific technologies: URLs (for uniquely identifying resources on the Internet), HTTP (a stateless protocol for transmitting documents) and HTML (a markup language). Each of these inventions was simple to understand, as well as simple to implement. And, it is this combination of simplicity and elegance that has made the Web the success that it is.
All three of these technologies have evolved over the years, reflecting new uses and needs. For example, HTTP now supports a system of “headers” in both the request and response, which can do everything from indicate the content type of the response body to provide hints regarding how long the data should be cached.
HTML has grown up quite a bit as well, evolving to become a truly semantic markup language (with styling information moved to external CSS documents) with a more rigorous and standardized definition. Standardization has made HTML slightly harder to write, in that you need to be more careful about items, such as tag names (keeping them lowercase), attributes (because not all are valid in all contexts) and closing tags. One advantage to such standardization is that we now can predict to a much greater degree what pages will look like across different browsers. Sloppy HTML means that the browser has to decide what you meant, which can have consequences that vary widely in their influence on the way the page looks.
Although HTML validation might seem boring, it's actually an essential part of getting AJAX-powered, latest-paradigm, super-fancy Web sites to work. This month, I review a few tools I use to make sure the HTML I create is as standards-compliant as usual. I begin with some simple, manual tests that can run on individual pages. Then, I show some automated tools I use when developing applications in Ruby on Rails, allowing me to check the HTML of all of my pages en masse, including those that require password protection to access.
Before continuing, it's important to realize that HTML is a catchall term for many different, related markup languages. And, when I say markup, I mean that HTML is a language used to describe a text, identifying its different parts. For instance, a newspaper article will have a headline, one or more authors, one or more paragraphs of text, zero or more photographs, and one or more captions per photograph. A markup language doesn't add content to a document, but rather describes the individual parts of the document, so that they can be laid out and displayed in an appropriate way. In this sense, HTML is a direct descendant of SGML, a markup language that was developed many years previously, but which was far more difficult to work with.
Although there have been several versions of HTML over the years, let's focus on the ones that are most widely used today. Perhaps the most common version of HTML is an unstructured, unversioned, nonstandard document. I'm certainly guilty of creating many such documents, which look like this:
<html> <head> <title>This is the title</title> </head> <body> <h1>This is the headline</h1> <p>This is a paragraph</p> <p>This is another paragraph</p> </body> </html>
Nothing is wrong, per se, with the above document. But, because it fails to indicate which version of HTML it is using, browsers must make a variety of assumptions. These assumptions can make it hard to predict how different browsers will operate, using something known as quirks mode.
Fortunately, we can choose a standard implementation and indicate that to a browser by adding a DOCTYPE declaration at the top of the document. When assigning the value of DOCTYPE, you need to decide whether you will use HTML or XHTML (that is, an XML-compliant version of HTML), and whether you want the strict, transitional or frameset variety of that markup language.
The strict version of each markup language is the ideal version that allows no styling elements. On a modern site, such styling should be defined in CSS, not in HTML. However, it may be difficult for some sites to comply with the strict definition, either because their authoring tools use tags that aren't allowed in the strict definition, or because the site's authors want to use forbidden elements, such as those for embedded Flash. To make the transition to strict HTML easier, the standards allow for transitional HTML, which provides a larger number of tags.
Let's define our tiny document as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>This is the title</title> </head> <body> <h1>This is the headline</h1> <p>This is a paragraph</p> <p>This is another paragraph</p> </body> </html>
The <!DOCTYPE> declaration at the top of the page tells browsers (and any other programs that might try to parse the page) that we want to follow the standards, but that we'll do so using the transitional declaration.
Once we have indicated our willingness to apply the transitional standard, we may discover that our documents are no longer valid. For example, if I include an image in my HTML document:
With the above line inserted into my document, it is no longer valid, because it is missing an alt attribute. Once I add that attribute, the document is valid:
<img src="/images/foo.jpeg" alt="foo">
However, we can get even better results if we enforce XML considerations and declare our document to be XHTML transitional. To do that, we modify not only the !DOCTYPE declaration, but also the <html> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>This is the title</title> </head> <body> <h1>This is the headline</h1> <p>This is a paragraph</p> <p>This is another paragraph</p> <img src="/images/foo.jpeg" alt="foo"> </body> </html>
Suddenly, our document is invalid again. Because we have declared it to be XHTML transitional, we need to follow XML rules. We need to close our <img> tag, most easily accomplished by using the self-closing syntax:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>This is the title</title> </head> <body> <h1>This is the headline</h1> <p>This is a paragraph</p> <p>This is another paragraph</p> <img src="/images/foo.jpeg" alt="foo" /> </body> </html>
With that in place, our document is now valid. As you can imagine, finding all the problems that might occur in a document can be difficult, even for someone who is trained and experienced. Trying to check all the pages on a site, particularly one that contains hundreds or thousands of pages, clearly would be impossible.
The solution, then, is to have a program check the pages' validity automatically, preferably as part of your automated tests. This way, you can discover when you have problems quickly and easily.
Web Development News
- Vigilante Malware
- Disney's Linux Light Bulbs (Not a "Luxo Jr." Reboot)
- Libreboot on an X60, Part I: the Setup
- Dealing with Boundary Issues
- Bluetooth Hacks
- System Status as SMS Text Messages
- Vagrant Simplified
- October 2015 Video Preview
- October 2015 Issue of Linux Journal: Raspberry Pi
- Non-Linux FOSS: Code Your Way To Victory!