Semantic Web Publishing with RDFa
I learned UNIX from a real old-style guru named Jimmy who memorized microchip numbers and used sed as a word processor. Wanting to do well on my first job, I proudly showed him how I was putting detailed comments in my code. My mentor was not impressed. “Why are you doing that?” he shot at me, going on to explain that neither comments nor docs could be trusted. If you wanted to understand what the code was doing, you better read the code. Software project managers might not agree, but Jimmy did have a point. Docs and comments can become out of date or inaccurate, but the code can't. Broken, yes. Inaccurate, no.
A similar issue arises when writing a Web page that is intended to be read by humans and parsed by machines. New sophisticated search engines on the horizon will be hungry for semantic content—that is, for data that can be machine-parsed for meaning. Often the format will be some form of RDF, or Resource Description Framework. If you are publishing Web pages in order to share your data with the world, it follows that you want to make it available to both humans and search engines. Generating two sets of files, one human-readable HTML and another machine-parsable RDF, means that you give up the ability to hand-edit your HTML files to make corrections and sets up your site for likely inconsistency down the road—not to mention that full-on RDF/XML is verbose and ugly.
Enter RDFa, a lightweight relatively new mechanism for embedding structured data into HTML in a simple but fully standards-compliant way. I run a Web site that is generated from templates. To understand how RDFa might fit in to my site, I started with a simple manually created example: an event schedule for the local rodeo. Later in this article, I also briefly cover some of the emerging tools that automate RDF and RDFa and describe how one company has created a large-scale RDF implementation to solve enterprise problems. Now, here's the example.
My original sample code looked like this, in vanilla HTML:
<div> <h1>Saturday Rodeo Schedule 2/22/08</h1> <div> 2:00PM : Bull Riding </div> </div>
It's pretty straightforward and clear to the human reader of the Web page or even someone editing the source, but it's meaningless to a search engine. To make this event clear to an RDFa-parsing engine, my first step was to pick a vocabulary that has well-defined terms for events. Luckily, there is just such a vocabulary, based on the iCalendar standard for calendar data. The vocabulary or vocabularies used in a document are specified right in the <html> tag at the start of the document:
<html xmlns="http://www.w3.org/1999/xhtml" >
The xmlns stands for XML NameSpace, and cal is the shorthand name we'll use to refer to this namespace further down. The http://www.w3.org/2002/12/cal/ical# is the URL to the RDF vocabulary file, and http://www.w3.org/1999/xhtml is the URL for the standard XML namespace that you might already be including in your documents. I explain further on discovering those and deciding which to use in a bit. Applying a bit of RDFa using basic iCal properties, we have this:
<div id=RodeoSchedule2008> <h1>Saturday Rodeo Schedule 2/22/08</h1> <div rel="cal:Vevent"> <span property="cal:dtstart" content="20080222T1400-0700">2:00PM</span> : <span property="cal:summary">Bull Riding</span> </div> </div>
From the browser's point of view, the HTML layout is unchanged. If desired, class= properties could be added for CSS formatting and would not impact the RDFa logical structure. This is different from the microformat hCalendar (another popular way of representing calendar data in HTML), in which fixed class names are assigned.
One last step alerts parsers to the presence of RDFa in our document and also specifies the encoding or character set used. We add the following lines at the very beginning of the file, before the <html..> tag:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
Now, any application that understands RDFa can scan your Web page and learn that there is an event called Bull Riding occurring on February 22, 2008 at 2:00 PM PST. In fact, you can verify that you've communicated correctly with the RDFa world by using any of a number of validating/parsing services. Using the Python-based service at www.w3.org/2007/08/pyRdfa, called RDFa Distiller, we can see that the above snippet produces the following semantic data, in what is called the N3 format:
@prefix cal:<http://www.w3.org/2002/12/cal/ical#>. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. [ a cal:Vevent; cal:dtstart "20080222T1400-0700"; cal:summary "Bull Riding"].
N3 is a shorthand that people who work heavily in the RDF world like to use for writing and representing the triples that compose the Semantic Web.
Special Reports: DevOps
- Hash Tables—Theory and Practice
- The Ubuntu Conspiracy
- Making a PHP Site on Linux Work with a Microsoft SQL Server Database
- A First Look at IBM's New Linux Servers
- Vigilante Malware
- Disney's Linux Light Bulbs (Not a "Luxo Jr." Reboot)
- Vagrant Simplified
- Dealing with Boundary Issues
- System Status as SMS Text Messages
- Bluetooth Hacks