Linux for Suits - Atlas: Hoisting a New World of Search
There is history here. We first covered Jabber in the September 2000 issue of Linux Journal, more than seven years ago. At the time, Jeremie Miller, who invented Jabber, told me at least five years would pass before Jabber's protocol—later dubbed XMPP and approved by the IETF in 2003—would establish itself as a de facto standard. He was right.
We've stayed in touch over the years, as Jeremie's interests have spread outward from messaging and presence to other subjects—especially search. I guess it was about two years ago that he began to question whether search needed to be a single-source thing (for example, Google or Yahoo). That was when he also started talking about his new search project, called Atlas. Whenever I'd ask him what he was working on, he'd reply, “Atlas”.
I was all for it, because I've felt from the start that search engines are essentially kluges meant to overcome a directory deficiency in the Web itself. I've also thought that, although there was good to be found in the chaotic nature of everything to the right of the first single slash in every URL on the Web, there was something inherently wrong about relying on massive commercial advertising-powered search engines—with their bots and crawlers and proprietary weighting algorithms chugging through constantly updated indexes stored in hundreds of thousands of servers—just to find stuff. And, although I agree with David Weinberger that Everything Is Miscellaneous (the title of his excellent new book), I don't like relying on Google, Yahoo or MSN alone (usually just Google) to tell me what I mean when I search for something. Nothing matters more than meaning, and I don't like seeing it supplied by what Jeremie calls a “text box dictatorship”. In a May 28, 2007 blog post, he asks:
Why in such an advanced civilization have we become Knowledge Peasants who are so easily placated by the black magic of our Goovernor? Am I the only one wondering why these commercial boxes own such an important social function: what everything means?
The answer, he says, is:
Open open open! Open source, open distributed grids, open algorithms, open rankings, open networks of people cooperating to provide resources. The future of search is in open cooperation (and competition) based on a Meaning Economy—create meaning, exchange meaning, serve meaning.
My vision begins with an open protocol, allowing independent networks of search functions (crawling, indexing, ranking, serving, etc.) to peer and interop. All relationships between these networks are always fully transparent and openly published. Networks exchange knowledge between them, each adding new meaning to the information, each of them responsible for the reputations of their participants and peers. This is the very foundation of a Meaning Economy.
Tomorrow now has a meaning that we can all help build.
Jeremie hasn't been the only one on the open search case. Jimmy Wales, prime mover behind Wikipedia, attracted attention in December of last year when he said, “I want to create a completely transparent, open-source, freely licensed search engine”—as part of Wikia.org, a community development companion to Wikipedia. In a December 29, 2006 interview (www.wired.com/techbiz/it/news/2006/12/72376?currentPage=2), Wired asked him for specifics about that. Jimmy's reply was, “We don't know. That's something that's really very open-ended at this moment. It's really up to the community, and I suspect that there won't be a one-size-fits-all answer. It will depend on the topic and the type of search being conducted.” Subsequent interviews were similarly speculative and open-ended.
Then, on May 1, 2007, came news that Jeremie was joining the Wikia Project. In a prepared statement (news.com.com/2100-1032_3-6180379.html), Jimmy Wales said, “Jeremie is a brilliant thinker and a natural fit to help revolutionize the world of search....I believe Internet search is currently broken, and the way to fix it is to build a community whose mission is to develop a search platform that is open and totally transparent.”
Atlas was unveiled in a post to a list on July 5, 2007 (lists.wikia.com/pipermail/atlas-l/2007-July/000000.html). In that post, Jeremie said his “large vision” is “enabling search to become a part of the Internet's infrastructure. Building on Atlas as an open protocol, search can become a fully distributed and interoperable world-wide community. All of the participants can interact openly and in any role where they believe they can add value to the network.”
As for architecture, he offered this:
There are three primary roles within Atlas:
Factory—responsible to the content.
Collector—responsible to the keyword.
Broker—responsible to the Searcher.
Each of these actors must interact with the others to complete any search request. Any two roles could be performed by a single entity (whereas if all three are performed by one entity, the result would be a traditional, monolithic search engine).
A Factory is akin to a crawler in today's search engines. An Atlas Factory must fetch and process the content as intelligently as possible, performing analysis (such as Natural Language Processing) and normalizing it into distinct units. A Factory shares its highly refined and processed output with one or more Collectors based on who they believe is best utilizing it.
A Collector absorbs and indexes output from one or more Factories, with one primary goal: ranking. An Atlas Collector must provide the most intelligent ranking and relationship analysis possible. A Collector has to compete for the output of a Factory, as well as compete to provide the best ranking quality for Brokers.
A Broker must provide a Searcher with the best possible results. It does so by combining diverse ranking results from Collectors and also by retrieving content from the original Factories. This last step, a Broker interacting with a Factory, is critical to maintaining a balanced ecosystem. All Factories must be aware of and approve how their results are being used and by whom.
Reputation and reward is bi-directional between all parties (Factory-Collector, Collector-Broker and Broker-Factory). Each entity may choose to interact on principle (free, Commons), attribution (results provided by), or commercially (as a paid service). The Atlas protocol is purely a facilitator and does not restrict how the relationships between any entities are formed. In considering these motives for the various entities, it's likely that the free-based networks will tend to become more specialized, commercial ones will compete on quality, and attribution-based networks will mature in both directions.
This simple yet powerful division of roles, responsibilities and relationships will result in a distributed economic foundation for an Internet Search Infrastructure. The wire protocol and further definition of the interactions between these entities is openly evolving; anyone interested is welcomed to join the discussions and see the initial proposals at lists.wikia.com/mailman/listinfo/atlas-l over the coming weeks.
As a kind of gauntlet, Jeremie threw down a summary challenge, “Nobody will beat Google, but EVERYBODY will.”
Vigorous discussion ensued, as a rapidly growing community of developers began getting into what Jeremie calls “the dirty work of building it openly now”. As that work began, I asked him if it would be cool to flow some of our conversation over here to Linux Journal. He said sure, so here it is.
DS: There is always this tug between monolith and polylith. The irony of the Net as a Giant Zero (world of ends) is that it is entirely polylithic—or wants to be. Centralizing a future polylithic protocol into a monolithic service is one way it starts. But the end state is polylithic.
JM: I agree, it's inevitable. It's the being of the Net itself that ultimately demands it, but Google is fighting to be a monolith for as long as possible...and that's fine, they'll embrace Atlas when they see it providing value.
DS: In your announcement [above] I see the seeds of a credit-where-due-based economic system—one in which we might obtain finders fees, or something like that, as value is given for service performed.
JM: The attribution-based model, yes. Absolutely the middle man needs to be involved in the transaction. Atlas doesn't flow the money, but it does flow the information and provide a framework. The same with advertising. Really, contextual ads are very helpful. I rely on them as a tool when using Google search. And in fact, that model is the best form of fighting Web spam.
How the system works, and who is involved in the flow of information, is completely transparent, so the three actors—a Factory, a Collector and a Broker—are all involved in providing a search result. A Broker works on behalf of the Searcher. They have relationships with the appropriate Collectors, plural, and perform the queries—assembling all the relevant “Knuggets” they get back from the Collectors, valuing them based on whatever metrics they want, including talking to a “sponsored” Collector who serves only commercial results, any “local” Collectors for regional areas, and so on.
Competition is fierce when anyone can be a Broker for almost no cost other than relationships. So, a Collector has one job, provides relevant results, and has to compete with anyone else to do so. And, it can judge the relevancy via any algorithmic, human-reputation-based, or combination.
DS: Open source at the production end has always been a meritocracy. Seems to me the equivalent with Atlas to “show me the code” is “show me the results”. Or at least, “show me the relevancy”. No?
JM: The Factory is managing commodity access to the refined content, doing all the work of normalizing the Web. Search results are just merit: who has the best. And a Broker going to many sources, many Collectors (there can be lots of them) gains lots of merits in different forms. So foxmarks has a great database of deep links that are very important to people. That way Mitch can serve high-quality results but only for certain categories of queries. And Wikipedia can serve another category of queries with high relevancy—as can local yellow-page-style systems, as can social networks for people queries.
DS: I see implicit in this a respect for the snowballing nature of knowledge, both for individuals and for groups. To be human is to grow what one knows. Authority is the right we grant certain others to contribute to what we know—and to change us in the process. Knowing more makes me different. As has been said elsewhere, “we are all authors of each other”.
JM: Yes, the fundamental unit of Atlas is a “Knugget”, a Knowledge Nugget essentially, a search result. A Factory adds value based on what it knows about the content; a Collector adds value based on what it knows about keywords and ranking Knuggets; and a Broker adds value based on what Collectors it knows and what value they provide in aggregate.
DS: Wikipedia, in growing its own relevance, is an interesting example. I've been looking at radio stations and Webcasters. Wikipedia on the whole is a great source of info, but it's far from complete. And, it needs a better way to stay complete than just relying on narrow subject obsessives to stay on top of the current narrative. Search results that feed into better Knuggets that turn into better Wikipedia entries should be a Good Thing, no?
DS: Is a Knugget “something somebody wants to know”? I like the word. How about if it's a combination of keywords that may change over time?
JM: A Knugget is one unit of context, as I define it. It may be a title and a link; it may be a sentence saying something about a noun; it might be a row from a table of things. It's human-defined and, therefore, very fuzzy by its very nature. It's “What would a human recognize and make some sense of, out of context of anything other than what's contained inside of it?” The Web is human, not machine, and Atlas reflects that.
DS: I like contextuality. The summit of Mt. Everest can be an elevation, a sum of climbers, a single fact (such as, it is marine limestone).
JM: Yes. The very nature of Atlas is to demand that a Factory produces the best Knuggets, that a Factory “understands” the content as best as it can. It is a model that rewards human understanding and value first. All derivative knowledge is built atop that foundation, and we can reduce all of these inflated DB/schema disasters, which all serve the machine first. Atlas works in the same way that the Internet served people first, and servers/data centers second—and the Web content serves people first and software second. Search-as-Internet-infrastructure must serve people first.
DS: Yes. I think there is a Static (traditional Google) vs. Live (human, now, evolving) distinction here. Allow me to quote myself at a bit of length: “And as for live feeds of Knuggets as they are produced, any content provider can get the most value by generating these feeds of Knuggets into Collectors they trust, search results can be instantly rewarded, a Searcher can find that breaking-news article immediately.”
One becomes an Atlas Broker just by being involved, I would gather.
JM: Just by searching, and knowing whom to ask for results.
DS: How do you want to seed this thing? Where does actual use start first?
JM: An Atlas Broker should be living here in my IM client, using all this great context, LOCALLY, to search my IM history as well as provide the best relevant search results from any content store I want it to.
DS: Who is the Broker?
JM: The Broker is a human. The Broker is just a service that uses as much context as you, the Searcher, wants to give it, and it talks to many Collectors to merge/provide the best results.
DS: So, how are you seeding this thing?
JM: I'm already starting to set up Atlas Factories. We may announce a contest to build open-source Collectors that can do different/cool things with, say, the Internet Archive. Once Atlas starts to breathe on its own, and can publish that archive either openly or gated (attribution-based or paid), or if it can rank Web results better for a certain class of queries, it can use that data as a Collector and offer that again, openly or gated. But that's speculative and premature. It's all wide open at this stage.
DS: Outstanding. I know some investors have leaned on Technorati to blow away the archives because that's not what they search. Yet.
I guess I need to get a sense of what a search might look like, and how it would come up with stuff that's different from Google's static search and Technorati's live (chrono) search. Would I go to atlas.wikia.org to search? Or to...where? Or is that question too static or site-based?
JM: I don't have a “static” presence for Atlas yet. Kind of refusing even to do that for whatever reasons. Just a link to the mailing list for now is all there is. Someday there will be a presence, but the discussion is more important right now. I like it that all people can do at the moment is discuss.
DS: I'm just looking at how to help people conceive What It Is. Is it a site? A service that other sites, or even IM systems, or cell-phone apps, can use?
JM: Atlas is an idea, a model and, ultimately, a communication system between two people: the one that wants to learn and the one that wants to teach/share. It's just another communication platform, but the people talking don't know each other yet. Like all good Internet systems, it will live under the hood, behind text input boxes everywhere.
DS: In that respect, it's more like the Jabber “platform” than the AIM or Skype “platforms”.
JM: Yep, in that Google, Yahoo and Ask are the IM silos, and Atlas is the distributed/open Jabber model.
DS: Good. I get that. Is Atlas code that sits somewhere and is given a bunch of stuff to look at? If so, where does it live? Is there a drawing we can use? Some kind of simple whiteboarding?
JM: Where stuff is almost doesn't matter. As far as visualizations go, the “logical” one is really Factory→Collector→Broker, but the technical one is more of a triangle, as the Broker talks to the Factory in the end to get the Knuggets...but basically no, there's nothing visual yet. A Factory is going to “look” like a pile of search results ordered based on the content source alone. A Collector will aggregate/order them. And a Broker will aggregate from lots of Collectors the ordered results, merging them, getting the snippets from the Factories they came from, and presenting them. By the way, someone made the first visualization of the model (Figure 1).
DS: Where would they live? The actors...Factory, Collector, Broker...or the code for it all?
JM: Oh, anyone can run any of them. There will be open-source projects to provide each of them. But, there should be thousands of each, running everywhere around the world. Just like Jabber, e-mail and Web servers. They just talk to each other with a protocol. There can be locale-based specific instances, ones for different languages, ones for types of content (images, videos), ones for topics (gaming, finance)—whatever people want to do/specialize in. People can run them for whatever reasons they want. A Broker is really the endpoint, so the nature of the search has the Broker engaging the relevant Collectors. A Broker is doing what the word really means, brokering your search to lots of sources for the best results. A Broker is what would likely be built in to your browser or whatever is driving an input box anywhere.
DS: I like the way it maps to the real world.
JM: It doesn't force an operational model, and it just goes with whatever motives people have to run it. The good part is that it will work only if people find it valuable enough to run it; those are the best kinds of systems. So is all of this clear as mud?
DS: Good mud!
For more, look up Jeremie and Atlas on Google.
Until you can do the reverse.
Doc Searls is Senior Editor of Linux Journal. He is also a Visiting Scholar at the University of California at Santa Barbara and a Fellow with the Berkman Center for Internet and Society at Harvard University.
Doc Searls is Senior Editor of Linux Journal
- Is the Private Cloud a Real Cloud?
- Give new life to old phones and tablets with these tips!
- Readers' Choice Awards 2014 Poll
- rsync, Part II
- Tech Tip: Really Simple HTTP Server with Python
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- Memory Ordering in Modern Microprocessors, Part I