Searching the World Live Web
Editor's Note: The following is the text of the September 15 edition of Doc Searls' SuitWatch newsletter. Sign up to be a subscriber of this bi-weekly newsletter.
September 15--Live Web search got a lot bigger yesterday, when Google launched its new blogsearch engine. There's no direct link on the Google index page yet. For now, you can find it in the roster of services behind the "more" link. There are 29 of those, and Blog Search is the newest.
But the news is still big. It legitimizes the Live Web--and blogging in particular--in a big way.
Far as I know, the blog search category was born when David Sifry put a hack he called Technorati on a Penguin Computing Linux box that lived in his basement while he and I were working on "Building With Blogs", a feature for the February 2003 issue of Linux Journal. Dave needed to research blogs, so he created a tool for it. As of today, Technorati's traffic is #751 on Alexa, pushing 80 million page views per day. (Disclosure: I'm on Technorati's Advisory Board.)
Other Live Web search pioneers include Bloglines, Blogpulse, Feedster, IceRocket and PubSub. The results they yield are radically different from what you get with Wide Web searches, as well as from each other. Mostly, the results are newer. They're also more likely to come from individuals and live news services than from companies with static sites.
Let's say you want to search for Katrina and Interdictor. The latter is the Weblog of Michael Barnett, who helped keep DirectNIC's data center up and running through Hurricane Katrina and the crisis that has followed. Far more than a simple blog, Interdictor also has served as a message board, a tech support line and a zero-bullshit news service.
You'll get results on Google's and Yahoo's main pages, but they won't be especially current. I am writing this on September 14, and the top result on Google is from September 3. Nor can you plumb them through the dimension of time, staring at now.
Do the same search on Blogpulse, and you get results listed backward in time, with the latest at the top. You also can watch trend results for the same search. You can refine results by incrementally adding search terms. And you can track conversations from one URL's "seed". Search for the same thing on Feedster, and you get results listed either by relevance or date.
Do the same search on Icerocket, and you get results grouped by date, starting with today. You can refine your search to today, past week, past month or by date range. You also can follow trends here, and look back on your search history.
Do the same search on Technorati, and you get results from two hours to two days old, with the most recent at the top. The company tries to index everything within minutes. You also can find 5,680 posts tagged "katrina" and 5 tagged "interdictor".
Do the same search with Google's Blog Search, and you get 2,355 results. Although the overall look is similar to its Wide Web results, here you get the option of sorting by relevance or date. You also can go back 100 pages through the first 1,000 results and subscribe to feeds of the search as well. And, as you'd expect, it's much faster than all the others.
You can't search through PubSub, but you can subscribe to keywords and combinations of keywords. These searches are syndicated, so you can receive them in your own aggregator. In fact, most of the live Web engines provide feeds for searches of keywords, URLs or combinations of both.
Of the Wide Web engines, only A9 also competes in Live Web searches, using IceRocket.
All of them run on Linux, by the way. No news there, but worth reporting, of course.
So, what's the difference between the Wide Web and the Live Web? Glad you asked.
The simple difference is the Live Web is syndicated. That means every time something is posted or updated, a notification goes out, informing the world about it. The most familiar syndication method is RSS, which commonly stands for Really Simple Syndication. There are a number of different syndication formats--Google's Blogger uses Atom--but as a class we tend to call them all RSS. Those familiar little orange XML buttons are the common symbol for Live Web search feeds.
Wide Web search engines send out spiders to crawl through every site on the Web. On Google, that's about 8.2 billion pages. Live Web search engines crawl only syndicated pages and only when they're notified by a fresh feed from those pages. So, while Technorati searches through 17.1 million sources, it only indexes pages that send out fresh syndicated feeds.
Here's another way of looking at it: Wide Web indexing is proactive and archives everything, while Live Web indexing is reactive and archives only what's fresh.
Of course, a Live Web engine can archive much more than that, over a long period of time. But what matters most usually is what's freshest--or both relevant and fresh.
Another difference is in the rate of change in technologies, standards and practices. This results in highly varied search experiences that are bound to change over time. In the last few months, "tagging" posts (or photos on Flickr, or bookmarks on Del.icio.us) with categorical keywords has proven to be a handy way to discover and peruse ad hoc groupings. Technorati has been providing tag search along with tagging methods for several months now, and others are bound to follow. Meanwhile, Wide Web searching has remained a very consistent experience ever since Google taught users to trust PageRank.
In the few hours that have passed since Google's Blog Search has appeared, many posts in the Blogosphere have been predicting the death of the incumbent Live Web engines. A few minutes ago I spoke to Jason Goldman, who runs both Blogger and Blog Search at Google. Rather than predicting the death of competitors in the Live Web space, he said he expected it to become energized and to grow. He also appreciated David Sifry's blog post, welcoming Google to the space.
Somewhere in there, a friend sent me a message reminding me that Apple also publicly welcomed IBM to personal computing in 1982, when the IBM PC was introduced. The implication was that IBM flattened Apple. In a way, that happened. But it's worth noting also that Apple is still very much around, healthy and a leader in its industry--and some others too. Jason also reminded me that there were many predictions of death for competitors when Google bought Blogger. Instead, the blog creation tool business only got bigger.
Every industry needs its mainstays and its pioneers. The Live Web has both now. And it will be better for everybody if they all do what they do best.
Doc Searls is Senior Editor of Linux Journal, for which he writes the Linux for Suits column. He also presides over Doc Searls' IT Garage, which is published by SSC, the publisher of Linux Journal.
Doc Searls is Senior Editor of Linux Journal
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- seo services in india
2 min 28 sec ago - For KDE install kio-mtp
3 min 10 sec ago - Evernote is much more...
2 hours 3 min ago - Reply to comment | Linux Journal
10 hours 48 min ago - Dynamic DNS
11 hours 22 min ago - Reply to comment | Linux Journal
12 hours 21 min ago - Reply to comment | Linux Journal
13 hours 11 min ago - Not free anymore
17 hours 13 min ago - Great
21 hours 32 sec ago - Reply to comment | Linux Journal
21 hours 8 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?



Comments
Newsletter
I just signed up for the newsletter, I am so excited. Thank You and I look forward to finally getting to read something exciting.
thx for the info
i did not know about Icerocket!
very helpful ...
" Web search engines send out
" Web search engines send out spiders to crawl through every site on the Web. On Google, that's about 8.2 billion pages. Live Web search engines crawl only syndicated pages and only when they're notified by a fresh feed from those pages. So, while Technorati searches through 17.1 million sources, it only indexes pages that send out fresh syndicated feeds.
there's another way of looking at it: Wide Web indexing is proactive and archives everything, while Live Web indexing is reactive and archives only what's fresh."
This is quite interesting, but can somebody enlight me on how feeds inform the search enginees about their fresh entries? It seems to me that Live Web search enginees still need to go out to check the XML source of all the feeds and determine if they have been updated since the last visit. Some feeds might register with Technorati for "ping back", but that's probably a very small portion of the Live Web.
A few corrections: Daypop
A few corrections:
Daypop was the first blog search engine, around since 2001, well befor Technorati.
A9's opensearch interoperates with a whole bunch of Live Web search engines. A quick check shows Feedster, Blogwise, Bulkfeeds, Blogdigger, blogdb.jp, blogWatcher and Findory.
Live Web search engines dont only crawl RSS feeds. Some do, some don't. Technorati does not only crawl RSS feeds.
Many Live Web search engies archive everything. They rank by freshness.