Linux for Suits - Unusual Suspects
Building infrastructure for a search engine like Technorati's, which scours and archives the “live” Web of RSS and similar feeds, is like building a series of barns, each after the last one burns down. When recounting the brief history of his company, David Sifry usually says, “Well, after the first infrastructure fell down...and after the second infrastructure fell down...and after the third infrastructure fell down....” No wonder he also says, “Scaling is everything.” Instructive failure is a less-glamorous description of trailblazing.
Technorati's trail started in David's basement, on a server that Penguin Computing loaned to David and me for a feature on blogs for the March 2003 issue of Linux Journal. See “Building with Blogs” and “The Technorati Story” in the on-line Resources. The service instantly was popular and has continued to grow steadily. By the time you read this, Technorati will be watching nearly four-million syndicated sources, mostly Weblogs, and well over a half-billion links. (Disclaimer: I'm on Technorati's advisory board.) Think of Technorati as a search engine for stuff that's too new for Google—plus a platform for free and paid services based on the company's growing archives and countless potential forms of derived data.
One can look at Technorati as another fast-scaling LAMP hack. But when I talk to Technorati's techies, it turns out some of the most interesting applications are from open-source suspects outside the familiar LAMP list of Linux, Apache, MySQL, PHP, Perl, Python and PostgreSQL. I recently asked the company's new VP of Engineering, Adam Hertz, who came over from Ofoto in July 2004, to tell me more about one or two of those useful but low-box-office tools. He immediately pointed to Memcached and handed the explaining over to Ian Kallen, senior architect on his staff.
“A lot of what we're doing has never been done before”, Ian told me. “But a lot of the scaling problems we hit along the way have been experienced before. And the result is open-source freeware, often very robust, that eases exactly the pain we're experiencing.” One such result is Memcached, a distributed caching system developed by Danga Interactive, the folks behind LiveJournal, which accounts for a very large percentage of the blogs Technorati follows. Ian explains:
Lookup queries in a large data repository can be expensive. If lookups are repeated at a rate near or less than the rate of change, it makes sense to spare the application repeated round trips to fetch the data. An obvious solution is to have the application process keep lookup results in a cache. The downside is that caching is typically resource consumptive; in a multiprocess application such as a Web service, having each process (or thread) keep its own cache reduces the cache efficiency and bloats the application resource footprint. The solution to this is a shared cache.
High application availability requirements are typically met with server redundancy. Now if each server instance has a cache of its own, the cache efficiency reduction becomes as ridiculous as having individual processes share a cache on a single host. The solution there is a distributed cache and that's where Memcached comes in.
Listening on a network socket and utilizing very simple parameters to instrument memory allocation, Memcached maintains an in-memory dictionary of keys that is dynamically populated with values. Technorati uses search parameters as keys and search results as values. The payback comes when search results that are duplicative of previously run queries return; they come back at least an order of magnitude faster.
In addition to its speed and simplicity, Memcached's other principal strength is its flexibility. It can store objects in language-native serialization formats such as Perl's Storable or Java's Serializable. Technorati uses PHP's native serialization to store cache results. However, Memcached can store any bytes that can be used interoperably; one cache client implemented in Java can access and read a cached XML document stored by another client written in Perl, Python or PHP. This can be a huge win for heterogeneous development environments.
Application stabilization and performance optimization is a critical concern for burgeoning data repositories such as Technorati's. Caching isn't the only thing in the software toolbox. Application architectures can still benefit or suffer from the quality of other design decisions. However, the integration challenge posed by Memcached is so low as to make it a primary consideration for data retrieval acceleration.
When I asked Adam again about other less-known tools that might be in his box, he said the company also was about to use the Spread Toolkit—a language-independent messaging system that allows updates, events and information to flow through distributed systems. He explains:
When a blog pings us, think of that as an event. Our spider responds and goes and gets the new content. It then puts the update on the message bus, so any application can see the message as it goes by. Each subscriber—a Technorati application or service—gets a chance to see every message that goes by. It can either pick it up or pass on it. An application can say, “hey, am I interested in this update?” This way, we reduce the internal query load on the database, and we also make the applications more real-time. This is called multicast. It's a service similar to what TIBCO has been for financial systems.
For example, a blog post about a political book might be interesting to Book Talk, News Talk and Politics Today. Rather than having all three applications querying the database to find relevant updates in the last five minutes, that blog post would travel across the system and be picked up by all three applications.
This not only speeds up performance for users and applications, but opens lots of new service opportunities for users, outside applications and services using the Technorati API.
Jonathan Stanton, one of Spread's creators at Johns Hopkins University, added this:
Spread (also) provides fault-tolerant messaging. Meaning if some machines that are part of a Spread group crash or are partitioned from the others, Spread guarantees that all the machines that are still connected will see the same set of messages. This strong semantic group messaging can be used to build replicated databases (one open-source project building a replicated database using Spread is Postgres-R) and guarantee consistency in clustered caching systems (like Technorati's).
Five interesting things here. First, note where the Technorati doesn't look for answers. The old industrial model assumes that you obtain the expertise you need internally (from a responsible position on the company org chart), or you buy products and expertise from outside vendors and consultants. Here, we have a CIO and a lead engineer looking outside the company for infrastructural building materials, plus experts and expertise that are both in the marketplace yet not for sale. In other words, tools like Memcached and Spread are part of a larger conversation, and a larger set of relationships, than a corporate customer like Technorati would get from a commercial vendor. This isn't a knock on vendors, but it reminds us how markets defined in terms of vendor relationships, and of sales volumes in product categories, fail to include a large and growing part of the market ecosystem.
Second, the advantages of using open-source tools, and of participating in development projects like Memcached and Spread, are likely to be lost on traditional IT shops that discourage trailblazing DIY work. Adam Hertz explains:
Two themes: BigCoIT is all about standardization and isolationism.
Standardization, so the story goes, reduces risks and costs. It certainly reduces complexity, but it can take a huge toll on flexibility and responsiveness. Standardization often involves using one multipurpose tool or platform to accomplish lots of different purposes. This often involves customization, which is done by in-house experts or professional services firms. Great examples are Siebel, Lotus Notes and so on.
DIY shops tend to be cynical, or even downright frightened, of systems like that, because they're so inflexible and unhackable.
Another form of standardization is what people are allowed to have on their desktop PCs. In a lot of big shops, everyone has the same disk image, with all applications pre-installed. There's a huge suspicion of anything that comes from the outside world, especially open source. It's regarded as flaky, virus-laden, unscalable and so on. This produces isolationism, which means that there are major barriers merely to try something.
In more open environments, there's a permeable membrane between the corporate IT environment and the Net. People tend to get new tools from the Net, usually open source, and just give 'em a spin. Culturally, this keeps the organization open to innovation and new approaches. It builds bonds between the employees and the development community at large.
Standardized, isolationist shops miss out on all of this. The maintain control, but they inevitably fall behind.
Third, Technorati isn't only solving problems, although problem solving does soak up a lot of IT cycles. It's also looking to tools like Memcached and Spread to open new business and other opportunities. When I asked Adam about using Spread to expand Technorati's Web service offerings, he said the opportunities, already wide open, only get wider:
It would also be very useful to implement customized watchlists, for example. Suppose you had some weird-assed query...or more accurately, filter. Like, very specific sorts of posts you want to find out about—posts that mention you, Linux Journal and open source, for example. We could set that up as a subscriber to updates. It could watch all the updates come by, discard the 99.99% of updates that aren't relevant, and when it gets a relevant one, it could send you mail or something. Contrast that with our current watchlist implementation that queries the database. Then and imagine if—or when! —we have 100,000 watchlists.
Fourth, Adam is quick to defer to his engineers' expertise, among which is finding useful and free open-source tools in the marketplace. The sounds Adam makes when he talks about his work remind me of those Linus Torvalds makes when he talks about kernel maintainers. Linus said:
It is very hard to find people who don't flame and are calm and rational—and have good taste. I mean it's like...give me one honest man. It doesn't happen...too much. And at the same time, when it happens, it matters a lot. Just a few of these people make a huge difference.
Fifth, it becomes clear that trailblazing work by tasteful flameproof engineers is what gives the world the infrastructure it needs to supports real business, in addition to maintaining the mundane internal operations of enterprises.
Just in the last few days, while I was writing this column, two other open-source tools came to my attention. They're interesting because they represent two extremes in the marketplace: one external, one internal.
The external one is the Gallery Project, which appears to be establishing itself rapidly as the Apache of photo gallery software. Within a week, Gallery went from something I barely knew to something it seemed everybody was telling me about. It's an extraordinarily rich and deep system that surely will be far more rich and deep by the time you read this. It shows that Apple, Adobe and Microsoft aren't the only ones who care to provide users and photographers with useful tools—and that the tools that matter most aren't those that live on clients or vendor hosting services, but rather on photographers' own servers, or those of their friends, relatives and other parties, including businesses. Gallery is pure DIY infrastructure and another dramatic example of how the Net and open source allow the demand side to supply itself.
I also think it's significant that Galley scratches an itch that isn't only technical. Although it's a bit of a hack to install, it's what marketers call a consumer-facing service. It will be interesting to watch the market effects over the coming year or two.
The internal tool is Yum (Yellowdog Updater Modified), an automatic updater and package installer/remover for RPM-based distributions, developed by the mathematics department at Duke University—specifically by Seth Vidal, the Senior Systems Programmer there.
Yum is public code, with public source repositories and a GPL license. Yet it was developed by Duke for its own internal requirements. Richard Hain, chair of the Math department, recently told me more than 100 Linux machines run in that department alone, with a serious commitment to running Linux on the desktop, as well as on larger machines for serious number crunching. Most office staff are running Linux desktops. Microsoft Windows users are running on Win4Lin. Yum is a necessity in their environment, Hain explains. Frequent hardware updates alone require a “departmental distro that runs Yum every day and maintains packages”.
So, it seems to me, the unusual suspects that matter most in open-source markets aren't only the tools, but the independent spirits that create and use them.
Resources for this article: www.linuxjournal.com/article/7806.