Google vs. AllTheWeb
November 20th, 2001 by Doc Searls in
Google has been a Linux community favorite for a long time (it runs on >10,000 Linux boxes). But does it finally face some competition from AllTheWeb.com, which reportedly runs on BSD?
There used to be a debate about which search engine was best. And maybe there still is, but we haven't been hearing much about it because Google is pretty much it. Even Yahoo uses Google. The situation is typified by these remarks posted by Jason Kottke the other day at Kottke.org: "Google has been down for most of the day (for me, at least), so I had to use, ugh, Altavista to search for something earlier. It's the first time I'd used something other than Google in more than a year, and it took me about 3 times as long as normal to find what I was looking for. Google is useful enough that I would pay a $5-8 subscription fee per month for access to it. Google is the default command-line interface to the Web...and well worth paying for."
Now there's a pull-quote for you: "default command-line interface for the Web". And maybe that's what we should expect from a well-funded runaway hack by Linux weenies (who nonetheless have a policy of patenting their software).
When you're the default de facto portal for searching everything on the Web, you don't need to do a lot of PR. So Google doesn't. But they're certainly glad to share info when they're asked, which is what happened when I asked Google's VP Corporate Communications, Cindy McCaffrey, to share a few up-to-date facts about the company. Here's some of what she gave me:
Data centers: 4
Linux computers: >10,000
Searches per day: >150 million
Index of Web pages: >1.6 billion
Image base: >330 million
Usenet messages: >650 million (going back >5yrs)
Newsgroups: >35,000
Language subsets in the index: 28
International domain sites: 23
PDFs: >22 million
Included in searches by file type: wk1,wk2, wk3, wk4, wk5, wki, wks, wku, mw, xls, ppt, doc, wks, wps, wdb, wr, irtf, ans, txt
They also have maps, phone directories, dictionary definitions, Web page translation... the list just keeps growing.
So that's their story, but there are others. About a year ago I started hearing from one of the hackers involved in FAST, which created the search engine found at www.AllTheWeb.com.
Fast Search and Transfer ASA is a Norwegian company with offices in the US and elsewhere. Their original and persistent goal has been to build the world's largest and deepest search engine. Early on they partnered with Dell and Lycos, which ultimately employed FAST engines for searching the Web, images, multimedia and everything else.
And now FAST has rebranded its site as "AllTheWeb", with the tagline "all the web. all the time". And they're doing some aggressive PR. Normally I resist that kind of thing, but I've been warming to these Norwegian guys ever since I started hearing from them, mostly because they felt that they should be no less legit to the community than Google. Their engines run on FreeBSD and were developed on FreeBSD and Linux machines. In fact, FAST's first engine, FTPsearch, was developed under the GPL. You can still download the GPL version of that software at ftp://ftpsearch.ntnu.no/pub/ftpsearch/. Search results are also presented by Apache and PHP.
I was also told that some of the same folks were involved in PHP's development for a long time, and that many of FAST's R&D people in Norway come from one UNIX-oriented computer club at the university in Trodheim. It's called "Programvareverkstedet," or PVV.
Whether it's merit, PR or both, AllTheWeb.com is clearly getting some mojo going. A few days ago Kevin Elliot at About.com wrote, "for searches related to news and current events, it blows the conventional wisdom about Google right out of the water". There's more positive spin at SearchDay, Pandia, Research Buzz and the company's own press release list.
I just ran a quick test of the two services. Here's how they did, at least in terms of returning raw numbers:
"Linux Journal":
"Don Marti":
"Geeks on the Half Shell":
That last one was a real test, because it referred to a real piece that's been up on both the old and the new LJ site since November 7.
So here's a PR lesson for the AllTheWeb folks. If you're going to send out press releases to editors bragging about how fast you crawl news sites, at least crawl the ones you're pitching.
That said, I've been an AllTheWeb user since it started, and I still use their image searches as much as I use Google's. If you're in heavy search mode, it's better to choose between them with AND logic, not OR.
Doc Searls is Senior Editor of Linux Journal.
Doc Searls is Senior Editor of Linux Journal
Special Magazine Offer -- 2 Free Trial Issues!
Receive 2 free trial issues of Linux Journal as well as instant online access to current and past issues. There's NO RISK and NO OBLIGATION to buy. CLICK HERE for offer
Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.
Sorry, offer available in the US only. International orders, click here.
Subscribe now!
The Latest
Featured Videos
Linux Journal Live - Oct 9, 2008
October 9th, 2008 by Shawn Powers
The October 9, 2008 edition of Linux Journal Live! Associate Editor, Shawn Powers, and Kyle Rankin, "Hack and /" columnist and author of Knoppix Hacks, Linux Multimedia Hacks, Knoppix Pocket Reference and others, discuss Linux distributions.
Linux Journal Live - Oct 2, 2008
October 3rd, 2008 by Shawn Powers
The October 2, 2008 edition of Linux Journal Live! Associate Editor, Shawn Powers, and Steven Evatt, Online Development manager for The Houston Chronicle discuss surviving disaster with Linux.
Recently Popular
From the Magazine
November 2008, #175
There aren't many numbers that put the US national debt to shame, but here's one: 1,100,000,000,000,000. What's that? That's how many floating-point operations per second the Roadrunner supercomputer at Las Alamos can perform. That's about 100 FLOPS per dollar of US debt (unfortunately, the debt is winning the second derivative race). Read the article about Roadrunner in this month's High Performance Computing issue of LJ.
Along with that, find out how to program the Cell processor and how to use CUDA with your NVIDIA GPU. Also in this issue: Mr HandS (aka Kyle Rankin) gives us a few tips on using Compiz, Chef Marcel shows you how to get blogging off your plate quicker, Mick Bauer talks about Samba security, Dan Sawyer interviews Cory Doctrow and Doc talks about how information technology can affect democracy and fix the national debt (just kidding about that last part). That and more for your reading pleasure in this month's Linux Journal.
Delicious
Digg
Reddit
Newsvine
Technorati








Re: Google vs. AllTheWeb
On December 3rd, 2001 Anonymous says:
One thinge to be said in favor of AllTheWeb: it is much better for specific-phrase searches that include common words such as "the" or "and". With Google, such words are not indexed, so you can't really use them to limit your search results.
Here's a good test. Suppose you're a fan of the band "The The". Enter the band name in Google, and you will get no results. Try it on AllTheWeb, and you'll get lots of results.
Re: Google vs. AllTheWeb
On October 6th, 2004 Anonymous says:
f u
Re: Google vs. AllTheWeb
On December 5th, 2001 Anonymous says:
If you want to search "The The" on Google you will have to do it this way "+The +The" to include the "stop" words.
(quotes used to emphasize query string only)
Re: Google vs AlltheWeb
On December 4th, 2001 Anonymous says:
Bear in mind that Google has just (last two days of so) started recognising "stop words" in phrases. So an a to z of computers will probably only recognise the word COMPUTERS, but "an a to z of computers" (note the quotation marks) should recognise the whole lot. You could get the first version to work by entering +an +a +to +z +of computers
but even sticking in the + sign doesn't work on THE - although Google did announce that they may even include that word in the future.
Google has also announced that their index should be reindexed more frequently - perhaps not as often as the 9-12 days claimed by some engines, including (I believe) AllTheWeb, but not to be sniffed at. And of course, there is Google's Image search and non-html file coverage - both of which put everyone else in the shade. All of which makes me wonder - if Google is so good, how come I make extensive use of AllTheWeb? I love Google, but I still find AllTheWeb outperforms Google 35-40% of the time. It's not down to AllTheWeb's new query rewriting - I use that very sparingly since, as often as not, it completely wrecks the query I'm trying to post. Despite the enhanced News coverage at AllTheWeb, Moreover outperforms both of them for currency, and news.altavista.com offers by far the best archival news search. (Can't imagine that I would use AltaVista much for anything else, though). But when oh when will one of these great engines come up with the kind of flexibility that Northern Light has been offering for years? Full Boolean, end-truncation, internal single- and multi-character wildcards, nested parentheses, automatically re-running your search as an alert... fantastic! If Google or AllTheWeb start offering that kind of funcationality, that really will be the killer engine!
Re: Google vs. AllTheWeb
On November 28th, 2001 Anonymous says:
"Geeks on the Half Shell":
Google: 1
AllTheWeb: 0
Moreover: 1 and it still crawls faster
http://www.moreover.com/cgi-local/page?o=portal&h=Search+results+for...+%22Geeks+on+the+Half+Shell%22&query=%22Geeks+on+the+Half+Shell%22
I took a look at a sample of FASTs stories--those returned searching for
'afghanistan' in english
at c 3.30 pm GMT 21 nov 2001--and compared their pick up times and relevance
with Moreover's profession. Of the top 10 results, one was a
duplicate, two were links to pages of links (not articles) on minor local US papers, and one an interactive guide to daisy cutter
bombs--not irrelevant but also not a top ten afghan news story. Of the
remaining six stories, Moreover picked up three of them, 15,3 and 13 hours earlier
than FAST, who also gave the BBC source name on one of these in russian not
english. Of the three that Moreover did not pick up, FAST picked two of them up 5,
and 23 hours after the site claimed that they had been posted (the third is
not time stamped on the site).
The top ten stories returned by a search for 'afghanistan' on Moreover were
all news stories, all links went directly to the story & the biggest gap
between the sites claimed posting time and Moreover's pick up time was 1 hr 55.
There was only one story that appeared in both. Like FAST, Moreover returned
stories from 7 different sources (counting sections of CNN as one), but
whereas 6 of Moreover's were original publications, only four of Allthe Web's
were.
so-- AlltheWeb:6/10 v Moreover:10/10 for relevance to 'top ten afghan stories' And thats not
factoring in the quality of sources.
Re: Google vs. AllTheWeb
On November 23rd, 2001 Glennf (not verified) says:
I had the opportunity to write about both Google and Fast/Alltheweb.com for the New York Times in the last few weeks. Google's new document type indexing and HTML conversion of business docs (Word, PowerPoint, Excel, Lotus, etc., etc.) vastly expands the potential for a search engine to peer in the corners of the Web. Their count of 1.6 billion pages is probably too high, though: their duplicate removal isn't as aggressive as Fast's, and they count pages that they have just link text for: pages that they know exist only because of links on other sites.
Fast, on the other hand, turned its attention to beginners and news in the latest update a couple weeks ago. Their news engine is now superior to any other that I've found on the Web. They are spidering 3,000 sources several times an hour. They said that freshness was the focus of this latest update, but they hope to expand out to document types, too, and there's no technical reason that they can't.
Google started getting fresh this summer: try popular blog pages and see how recent the home page index is. Impressive, too.
Re: Google vs. AllTheWeb
On November 23rd, 2001 Anonymous says:
I've been using alltheweb for years, but I'm the only person I know who does so. Sometimes I find their results "better" than ones from google, at other times "worse".
One is however sure: when it comes to "bringing people to my site", alltheweb is of absolutely no importance, while google always ranks among "top refferers".
Deno (from mandrakeforum)
Re: Google vs. AllTheWeb
On November 22nd, 2001 Anonymous says:
I feel google is pretty faster than AllTheWeb in search. As on moment, google seems to better and faster search engine compared with other search engines.
Filesearch
On November 21st, 2001 Anonymous says:
When I started searching the net, there was only one service I was interested in: Archie via telnet. What else then ftp-able file should I have looked for? Google is really lacking a 5th tab... Maybe in black? When the WWW thing started I switched to a webbased service provided by the university of Trondheim...
Post new comment