At the Forge - Checking Your HTML
We say that Tim Berners-Lee invented the World Wide Web, and that's certainly true. But, we can boil the Web down to three specific technologies: URLs (for uniquely identifying resources on the Internet), HTTP (a stateless protocol for transmitting documents) and HTML (a markup language). Each of these inventions was simple to understand, as well as simple to implement. And, it is this combination of simplicity and elegance that has made the Web the success that it is.
All three of these technologies have evolved over the years, reflecting new uses and needs. For example, HTTP now supports a system of “headers” in both the request and response, which can do everything from indicate the content type of the response body to provide hints regarding how long the data should be cached.
HTML has grown up quite a bit as well, evolving to become a truly semantic markup language (with styling information moved to external CSS documents) with a more rigorous and standardized definition. Standardization has made HTML slightly harder to write, in that you need to be more careful about items, such as tag names (keeping them lowercase), attributes (because not all are valid in all contexts) and closing tags. One advantage to such standardization is that we now can predict to a much greater degree what pages will look like across different browsers. Sloppy HTML means that the browser has to decide what you meant, which can have consequences that vary widely in their influence on the way the page looks.
More significant, the rise of AJAX as a paradigm for Web development has made it increasingly important that HTML be well formed. Many AJAX-related routines need to modify a particular element on the page in some way. The easiest way to do this typically is to grab the element via its id attribute, which is guaranteed to be unique. (If you want more than one element to use an ID, you really should use a class instead.) In the last few months, I have worked on a number of pages that had duplicate ID attributes. Sometimes this was the result of a simple mistake, and sometimes it resulted from ignorance on the part of a Web designer. But in all cases, this meant that my JavaScript performed differently from what I expected.
Although HTML validation might seem boring, it's actually an essential part of getting AJAX-powered, latest-paradigm, super-fancy Web sites to work. This month, I review a few tools I use to make sure the HTML I create is as standards-compliant as usual. I begin with some simple, manual tests that can run on individual pages. Then, I show some automated tools I use when developing applications in Ruby on Rails, allowing me to check the HTML of all of my pages en masse, including those that require password protection to access.
Before continuing, it's important to realize that HTML is a catchall term for many different, related markup languages. And, when I say markup, I mean that HTML is a language used to describe a text, identifying its different parts. For instance, a newspaper article will have a headline, one or more authors, one or more paragraphs of text, zero or more photographs, and one or more captions per photograph. A markup language doesn't add content to a document, but rather describes the individual parts of the document, so that they can be laid out and displayed in an appropriate way. In this sense, HTML is a direct descendant of SGML, a markup language that was developed many years previously, but which was far more difficult to work with.
Although there have been several versions of HTML over the years, let's focus on the ones that are most widely used today. Perhaps the most common version of HTML is an unstructured, unversioned, nonstandard document. I'm certainly guilty of creating many such documents, which look like this:
<html>
<head>
<title>This is the title</title>
</head>
<body>
<h1>This is the headline</h1>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
</body>
</html>
Nothing is wrong, per se, with the above document. But, because it fails to indicate which version of HTML it is using, browsers must make a variety of assumptions. These assumptions can make it hard to predict how different browsers will operate, using something known as quirks mode.
Fortunately, we can choose a standard implementation and indicate that to a browser by adding a DOCTYPE declaration at the top of the document. When assigning the value of DOCTYPE, you need to decide whether you will use HTML or XHTML (that is, an XML-compliant version of HTML), and whether you want the strict, transitional or frameset variety of that markup language.
The strict version of each markup language is the ideal version that allows no styling elements. On a modern site, such styling should be defined in CSS, not in HTML. However, it may be difficult for some sites to comply with the strict definition, either because their authoring tools use tags that aren't allowed in the strict definition, or because the site's authors want to use forbidden elements, such as those for embedded Flash. To make the transition to strict HTML easier, the standards allow for transitional HTML, which provides a larger number of tags.
Let's define our tiny document as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>This is the title</title>
</head>
<body>
<h1>This is the headline</h1>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
</body>
</html>
The <!DOCTYPE> declaration at the top of the page tells browsers (and any other programs that might try to parse the page) that we want to follow the standards, but that we'll do so using the transitional declaration.
Once we have indicated our willingness to apply the transitional standard, we may discover that our documents are no longer valid. For example, if I include an image in my HTML document:
<img src="/images/foo.jpeg">
With the above line inserted into my document, it is no longer valid, because it is missing an alt attribute. Once I add that attribute, the document is valid:
<img src="/images/foo.jpeg" alt="foo">
However, we can get even better results if we enforce XML considerations and declare our document to be XHTML transitional. To do that, we modify not only the !DOCTYPE declaration, but also the <html> tag:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>This is the title</title>
</head>
<body>
<h1>This is the headline</h1>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
<img src="/images/foo.jpeg" alt="foo">
</body>
</html>
Suddenly, our document is invalid again. Because we have declared it to be XHTML transitional, we need to follow XML rules. We need to close our <img> tag, most easily accomplished by using the self-closing syntax:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>This is the title</title>
</head>
<body>
<h1>This is the headline</h1>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
<img src="/images/foo.jpeg" alt="foo" />
</body>
</html>
With that in place, our document is now valid. As you can imagine, finding all the problems that might occur in a document can be difficult, even for someone who is trained and experienced. Trying to check all the pages on a site, particularly one that contains hundreds or thousands of pages, clearly would be impossible.
The solution, then, is to have a program check the pages' validity automatically, preferably as part of your automated tests. This way, you can discover when you have problems quickly and easily.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
Web Development News
Developer Poll
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Linux Systems Administrator
- Validate an E-Mail Address with PHP, the Right Way
- Introduction to MapReduce with Hadoop on Linux
- RSS Feeds
- Weechat, Irssi's Little Brother
- New Products
- Developer Poll








35 min 29 sec ago
45 min 49 sec ago
50 min 49 sec ago
3 hours 54 sec ago
3 hours 1 min ago
3 hours 47 min ago
4 hours 35 min ago
4 hours 59 min ago
6 hours 36 min ago
6 hours 37 min ago