Searching the World Live Web
Editor's Note: The following is the text of the September 15 edition of Doc Searls' SuitWatch newsletter. Sign up to be a subscriber of this bi-weekly newsletter.
September 15--Live Web search got a lot bigger yesterday, when Google launched its new blogsearch engine. There's no direct link on the Google index page yet. For now, you can find it in the roster of services behind the "more" link. There are 29 of those, and Blog Search is the newest.
But the news is still big. It legitimizes the Live Web--and blogging in particular--in a big way.
Far as I know, the blog search category was born when David Sifry put a hack he called Technorati on a Penguin Computing Linux box that lived in his basement while he and I were working on "Building With Blogs", a feature for the February 2003 issue of Linux Journal. Dave needed to research blogs, so he created a tool for it. As of today, Technorati's traffic is #751 on Alexa, pushing 80 million page views per day. (Disclosure: I'm on Technorati's Advisory Board.)
Other Live Web search pioneers include Bloglines, Blogpulse, Feedster, IceRocket and PubSub. The results they yield are radically different from what you get with Wide Web searches, as well as from each other. Mostly, the results are newer. They're also more likely to come from individuals and live news services than from companies with static sites.
Let's say you want to search for Katrina and Interdictor. The latter is the Weblog of Michael Barnett, who helped keep DirectNIC's data center up and running through Hurricane Katrina and the crisis that has followed. Far more than a simple blog, Interdictor also has served as a message board, a tech support line and a zero-bullshit news service.
You'll get results on Google's and Yahoo's main pages, but they won't be especially current. I am writing this on September 14, and the top result on Google is from September 3. Nor can you plumb them through the dimension of time, staring at now.
Do the same search on Blogpulse, and you get results listed backward in time, with the latest at the top. You also can watch trend results for the same search. You can refine results by incrementally adding search terms. And you can track conversations from one URL's "seed". Search for the same thing on Feedster, and you get results listed either by relevance or date.
Do the same search on Icerocket, and you get results grouped by date, starting with today. You can refine your search to today, past week, past month or by date range. You also can follow trends here, and look back on your search history.
Do the same search on Technorati, and you get results from two hours to two days old, with the most recent at the top. The company tries to index everything within minutes. You also can find 5,680 posts tagged "katrina" and 5 tagged "interdictor".
Do the same search with Google's Blog Search, and you get 2,355 results. Although the overall look is similar to its Wide Web results, here you get the option of sorting by relevance or date. You also can go back 100 pages through the first 1,000 results and subscribe to feeds of the search as well. And, as you'd expect, it's much faster than all the others.
You can't search through PubSub, but you can subscribe to keywords and combinations of keywords. These searches are syndicated, so you can receive them in your own aggregator. In fact, most of the live Web engines provide feeds for searches of keywords, URLs or combinations of both.
Of the Wide Web engines, only A9 also competes in Live Web searches, using IceRocket.
All of them run on Linux, by the way. No news there, but worth reporting, of course.
So, what's the difference between the Wide Web and the Live Web? Glad you asked.
The simple difference is the Live Web is syndicated. That means every time something is posted or updated, a notification goes out, informing the world about it. The most familiar syndication method is RSS, which commonly stands for Really Simple Syndication. There are a number of different syndication formats--Google's Blogger uses Atom--but as a class we tend to call them all RSS. Those familiar little orange XML buttons are the common symbol for Live Web search feeds.
Wide Web search engines send out spiders to crawl through every site on the Web. On Google, that's about 8.2 billion pages. Live Web search engines crawl only syndicated pages and only when they're notified by a fresh feed from those pages. So, while Technorati searches through 17.1 million sources, it only indexes pages that send out fresh syndicated feeds.
Here's another way of looking at it: Wide Web indexing is proactive and archives everything, while Live Web indexing is reactive and archives only what's fresh.
Of course, a Live Web engine can archive much more than that, over a long period of time. But what matters most usually is what's freshest--or both relevant and fresh.
Another difference is in the rate of change in technologies, standards and practices. This results in highly varied search experiences that are bound to change over time. In the last few months, "tagging" posts (or photos on Flickr, or bookmarks on Del.icio.us) with categorical keywords has proven to be a handy way to discover and peruse ad hoc groupings. Technorati has been providing tag search along with tagging methods for several months now, and others are bound to follow. Meanwhile, Wide Web searching has remained a very consistent experience ever since Google taught users to trust PageRank.
In the few hours that have passed since Google's Blog Search has appeared, many posts in the Blogosphere have been predicting the death of the incumbent Live Web engines. A few minutes ago I spoke to Jason Goldman, who runs both Blogger and Blog Search at Google. Rather than predicting the death of competitors in the Live Web space, he said he expected it to become energized and to grow. He also appreciated David Sifry's blog post, welcoming Google to the space.
Somewhere in there, a friend sent me a message reminding me that Apple also publicly welcomed IBM to personal computing in 1982, when the IBM PC was introduced. The implication was that IBM flattened Apple. In a way, that happened. But it's worth noting also that Apple is still very much around, healthy and a leader in its industry--and some others too. Jason also reminded me that there were many predictions of death for competitors when Google bought Blogger. Instead, the blog creation tool business only got bigger.
Every industry needs its mainstays and its pioneers. The Live Web has both now. And it will be better for everybody if they all do what they do best.
Doc Searls is Senior Editor of Linux Journal, for which he writes the Linux for Suits column. He also presides over Doc Searls' IT Garage, which is published by SSC, the publisher of Linux Journal.
Doc Searls is Senior Editor of Linux Journal
|Non-Linux FOSS: libnotify, OS X Style||Jun 18, 2013|
|Containers—Not Virtual Machines—Are the Future Cloud||Jun 17, 2013|
|Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer||Jun 12, 2013|
|Weechat, Irssi's Little Brother||Jun 11, 2013|
|One Tail Just Isn't Enough||Jun 07, 2013|
|Introduction to MapReduce with Hadoop on Linux||Jun 05, 2013|
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Linux Systems Administrator
- Introduction to MapReduce with Hadoop on Linux
- RSS Feeds
- New Products
- Weechat, Irssi's Little Brother
- Validate an E-Mail Address with PHP, the Right Way
- Tech Tip: Really Simple HTTP Server with Python
- Poul-Henning Kamp: welcome to
7 min 50 sec ago
- This has already been done
8 min 50 sec ago
- Reply to comment | Linux Journal
54 min 4 sec ago
- Welcome to 1998
1 hour 42 min ago
- notifier shortcomings
2 hours 6 min ago
3 hours 43 min ago
- Android User
3 hours 44 min ago
- Reply to comment | Linux Journal
5 hours 37 min ago
8 hours 27 min ago
- This is a good post. This
13 hours 40 min ago
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?