Necessary Censorship: Web Filtering with Open Source

by Glenn Stone

You're the administrator of a cash-strapped school system and received a note saying you'll finally be able to get the school connected to the Net--as soon as you have a plan to comply with CIPA(1). Or you're out in Corporate America, and when the boss typo-ed a URL, she saw some very interesting pictures on her screen. Or you're simply Joe Penguinhead at home, having had the talk with the spousal unit, and you've decided it's time for Junior to have a computer of his very own. In short, you're now stuck with committing censorship.

In the course of doing research for this article, I ran across pieces from EFF and Peacefire (plus one e-mailed lecture) saying that all censorship is bad that we should simply educate our children and coworkers on responsible cyber-surfing, blah, blah, blah.... It's true, and in an ideal world, we could do that. Unfortunately, the world is not populated by only responsible adults and well-educated children. And everyone makes a typo once in a while. Thus, we are forced to do something about it.

The Children's Internet Protection Act mandates that a school or library must have an Internet safety policy, must hold a public review of that policy and must use a "technology protection measure" on all computers connected to the Internet. Whether and how that software can be disabled on certain computers is a local decision. It does not mandate that software be perfect; indeed, many web pages, both administrative and commercial, emphasized that filtering would not be perfect. More on that in a minute.

Corporate firewalls often are quite a bit more aggressive. One colleague I spoke to said his employer didn't block only the naughty bits, but sports, third-party e-mail providers, web comics, in short, almost everything anything that wasn't work-related.

Then, there are those of us stuck on slow links who simply would like to not have third parties like Doubleclick fouling our bandwidth. This group probably contains a lot more of us than people might think.

The problem with most censorware--aside from the cost and the fact that it more than likely is written for platforms that people reading this would rather not be running--is one of control. Because the software is proprietary, not only do you not have control over what it is you're blocking, you don't even know what's on the blacklist. As I write this article, there is an ongoing lawsuit in Pennsylvania regarding free-speech advocates' ability to access the state's blacklist.

Some situations may call for more blocking, while other require less, but normally, no provision is in place about how to get the list changed. Two vendors do allow you to submit a URL for review: N2H2 and Dan's Guardian. While N2H2 does not publicize their entire list, they do have a URL checker. Dan's is even more open, but I'm getting ahead of myself.

So, I said to myself, "Self, if you can't beat them, perhaps it's time to join them." Maybe we need open-source censorware, strange as that may sound, with a publicly available list. It would offer the ability to tinker with both the code and the list to suit the needs of folks who have to do this type of work.

I was stunned by the answer I found: two such animals already are available. One is Dan's Guardian, which I mentioned above; the other is squidGuard, a plug-in for the Squid web proxy. Squid and squidGuard are offered under the GPL, and they are free as in beer as well. I'm getting some funny looks, I know; you'll see why in a minute. Both are apt-gettable for Debian fans. Mandrake folks can get them from the club site; or, do as Red Hat folks have to do, and compile it from source.

One of the items in squidGuard's contrib directory is squidGuardRobot, a spider that goes out and analyzes newly accessed web sites for content and then refreshes the blacklist. Because squidGuard is under the GPL, and you can put whatever blacklist in you want, you have complete control over how filtering works at the usual cost of maintenance. The Open Source Directory offers a plethora of free blacklists that work with squidGuard, arranged according to any category in their directory structure and by content rating (roughly equivalent to G, PG, PG-13 and so on).

Now, we come to Dan's Guardian. Dan's Guardian comes with an interesting licensing setup. It's GPL, which means the Debian folks have zero issues with putting it in their distributions, but it uses the clause in the GPL that allows a vendor to charge for GPL software. The web site says the scheme has been vetted personally by RMS as legitimate use of the GPL; it also passes this author's understanding thereof, for what it's worth. The blacklist is subscription-based, but free for trial use. A form on the blacklist download page allows users to add URLs to both the whitelist and blacklist, and, further down, feedback on recently submitted URLs.

Whereas squidGuard and other censorware, except Symantec's i-Gear, work on a simple URL list or URL regular expressions, Dan's Guardian actually looks at the content of the web page on the fly, scanning for words and phrases that meet the criteria for blacklisting or whitelisting. You also can use your squidGuard blacklists with Dan's Guardian, which means all that DMOZ stuff works here as well. Dan's Guardian works as a proxy plug-in with Squid the same way squidGuard does. It also works as a plug-in for Oops, another, lighter-weight web proxy.

Dan's Guardian is a cheap way to be CIPA compliant without having to worry about it a whole lot. The software is free for non-commercial use and an educational rate subscription for a once-monthly download of the actively maintained blacklist is $5/month or $60/year. I understand free software, but I'm far more interested in having it be free as in speech--the blacklist comes down in readable format--than free as in beer. I'm also not opposed to paying reasonable prices for good software. If you want to do both kinds of free, squidGuard is there. Be my guest; I'll likely join you. But there's something to be said for a cheap way that a busy librarian responsible can take care of the computers and not have to worry about what Johnny's going to see or what his mom is going to say about it.

Some of you probably will point me at Privoxy, the Sourceforge project that grew out of Junkbuster. While it's a great way to get rid of the ads and the cookies and the pop-ups, you'd have to convert the squidGuard-type blacklists over to Privoxy's format every time a new list came out, a less-than-efficient use of time and resources. The bandwidth suck on new lists alone is considerable--6MB for a typical one. Although corporate OC-class folk might think this is trivial, believe me, on 56K, it's decidedly painful.

So, there you go. It is possible to commit censorship in a totally GPL fashion so not only do you know what it is you're censoring, but so you can control it to your heart's content. Although open-source advocates generally consider arbitrary, proprietary censorship to be a bad thing--its alternative being one of the reasons behind the Open Source Movement--controlling what comes into your computer and network wisely and with an open mind is a good thing. After all, the big reason this author runs Linux is so that he, himself, controls what does and does not happen on his computers.

Glenn Stone is a Red Hat Certified Engineer, sysadmin, technical writer, cover model and general Linux flunkie. He has been hand-building computers for fun and profit since 1999, and he is a happy denizen of the Pacific Northwest.


Load Disqus comments