Searching the World Live Web

What impact will Google's new blogsearch engine have for the Live Web?

Editor's Note: The following is the text of the September 15 edition of Doc Searls' SuitWatch newsletter. Sign up to be a subscriber of this bi-weekly newsletter.

September 15--Live Web search got a lot bigger yesterday, when Google launched its new blogsearch engine. There's no direct link on the Google index page yet. For now, you can find it in the roster of services behind the "more" link. There are 29 of those, and Blog Search is the newest.

But the news is still big. It legitimizes the Live Web--and blogging in particular--in a big way.

Far as I know, the blog search category was born when David Sifry put a hack he called Technorati on a Penguin Computing Linux box that lived in his basement while he and I were working on "Building With Blogs", a feature for the February 2003 issue of Linux Journal. Dave needed to research blogs, so he created a tool for it. As of today, Technorati's traffic is #751 on Alexa, pushing 80 million page views per day. (Disclosure: I'm on Technorati's Advisory Board.)

Other Live Web search pioneers include Bloglines, Blogpulse, Feedster, IceRocket and PubSub. The results they yield are radically different from what you get with Wide Web searches, as well as from each other. Mostly, the results are newer. They're also more likely to come from individuals and live news services than from companies with static sites.

Let's say you want to search for Katrina and Interdictor. The latter is the Weblog of Michael Barnett, who helped keep DirectNIC's data center up and running through Hurricane Katrina and the crisis that has followed. Far more than a simple blog, Interdictor also has served as a message board, a tech support line and a zero-bullshit news service.

You'll get results on Google's and Yahoo's main pages, but they won't be especially current. I am writing this on September 14, and the top result on Google is from September 3. Nor can you plumb them through the dimension of time, staring at now.

Do the same search on Blogpulse, and you get results listed backward in time, with the latest at the top. You also can watch trend results for the same search. You can refine results by incrementally adding search terms. And you can track conversations from one URL's "seed". Search for the same thing on Feedster, and you get results listed either by relevance or date.

Do the same search on Icerocket, and you get results grouped by date, starting with today. You can refine your search to today, past week, past month or by date range. You also can follow trends here, and look back on your search history.

Do the same search on Technorati, and you get results from two hours to two days old, with the most recent at the top. The company tries to index everything within minutes. You also can find 5,680 posts tagged "katrina" and 5 tagged "interdictor".

Do the same search with Google's Blog Search, and you get 2,355 results. Although the overall look is similar to its Wide Web results, here you get the option of sorting by relevance or date. You also can go back 100 pages through the first 1,000 results and subscribe to feeds of the search as well. And, as you'd expect, it's much faster than all the others.

You can't search through PubSub, but you can subscribe to keywords and combinations of keywords. These searches are syndicated, so you can receive them in your own aggregator. In fact, most of the live Web engines provide feeds for searches of keywords, URLs or combinations of both.

Of the Wide Web engines, only A9 also competes in Live Web searches, using IceRocket.

All of them run on Linux, by the way. No news there, but worth reporting, of course.

So, what's the difference between the Wide Web and the Live Web? Glad you asked.

The simple difference is the Live Web is syndicated. That means every time something is posted or updated, a notification goes out, informing the world about it. The most familiar syndication method is RSS, which commonly stands for Really Simple Syndication. There are a number of different syndication formats--Google's Blogger uses Atom--but as a class we tend to call them all RSS. Those familiar little orange XML buttons are the common symbol for Live Web search feeds.

Wide Web search engines send out spiders to crawl through every site on the Web. On Google, that's about 8.2 billion pages. Live Web search engines crawl only syndicated pages and only when they're notified by a fresh feed from those pages. So, while Technorati searches through 17.1 million sources, it only indexes pages that send out fresh syndicated feeds.

Here's another way of looking at it: Wide Web indexing is proactive and archives everything, while Live Web indexing is reactive and archives only what's fresh.

Of course, a Live Web engine can archive much more than that, over a long period of time. But what matters most usually is what's freshest--or both relevant and fresh.

Another difference is in the rate of change in technologies, standards and practices. This results in highly varied search experiences that are bound to change over time. In the last few months, "tagging" posts (or photos on Flickr, or bookmarks on Del.icio.us) with categorical keywords has proven to be a handy way to discover and peruse ad hoc groupings. Technorati has been providing tag search along with tagging methods for several months now, and others are bound to follow. Meanwhile, Wide Web searching has remained a very consistent experience ever since Google taught users to trust PageRank.

In the few hours that have passed since Google's Blog Search has appeared, many posts in the Blogosphere have been predicting the death of the incumbent Live Web engines. A few minutes ago I spoke to Jason Goldman, who runs both Blogger and Blog Search at Google. Rather than predicting the death of competitors in the Live Web space, he said he expected it to become energized and to grow. He also appreciated David Sifry's blog post, welcoming Google to the space.

Somewhere in there, a friend sent me a message reminding me that Apple also publicly welcomed IBM to personal computing in 1982, when the IBM PC was introduced. The implication was that IBM flattened Apple. In a way, that happened. But it's worth noting also that Apple is still very much around, healthy and a leader in its industry--and some others too. Jason also reminded me that there were many predictions of death for competitors when Google bought Blogger. Instead, the blog creation tool business only got bigger.

Every industry needs its mainstays and its pioneers. The Live Web has both now. And it will be better for everybody if they all do what they do best.

Doc Searls is Senior Editor of Linux Journal, for which he writes the Linux for Suits column. He also presides over Doc Searls' IT Garage, which is published by SSC, the publisher of Linux Journal.

______________________

Doc Searls is Senior Editor of Linux Journal

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Newsletter

felix's picture

I just signed up for the newsletter, I am so excited. Thank You and I look forward to finally getting to read something exciting.

thx for the info

ajondo's picture

i did not know about Icerocket!
very helpful ...

" Web search engines send out

Qun Cao's picture

" Web search engines send out spiders to crawl through every site on the Web. On Google, that's about 8.2 billion pages. Live Web search engines crawl only syndicated pages and only when they're notified by a fresh feed from those pages. So, while Technorati searches through 17.1 million sources, it only indexes pages that send out fresh syndicated feeds.

there's another way of looking at it: Wide Web indexing is proactive and archives everything, while Live Web indexing is reactive and archives only what's fresh."

This is quite interesting, but can somebody enlight me on how feeds inform the search enginees about their fresh entries? It seems to me that Live Web search enginees still need to go out to check the XML source of all the feeds and determine if they have been updated since the last visit. Some feeds might register with Technorati for "ping back", but that's probably a very small portion of the Live Web.

A few corrections: Daypop

Anonymous's picture

A few corrections:

Daypop was the first blog search engine, around since 2001, well befor Technorati.

A9's opensearch interoperates with a whole bunch of Live Web search engines. A quick check shows Feedster, Blogwise, Bulkfeeds, Blogdigger, blogdb.jp, blogWatcher and Findory.

Live Web search engines dont only crawl RSS feeds. Some do, some don't. Technorati does not only crawl RSS feeds.

Many Live Web search engies archive everything. They rank by freshness.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix