Plug and Crunch: WhiteCross' Linux Story

Miscellaneous

by Doc Searls

on March 5, 2002

This brings us to the Personalization Summit in New York (April 2-4, 2001), where I was on a panel about “The Future of Personalization” (briefly, I said there wasn't much hope, but that's another story). Sitting near me in the audience was John K. Thompson, VP Worldwide Marketing for a company called WhiteCross.

“You know”, he said, “we have the world's fastest Linux analytical platform. Interested?”

Well, yeah. So a couple hours later we met again, and an interview followed.

Doc: Where did WhiteCross come from?

John: We were founded in 1992 by guys who left Teradata. From the beginning we were in the business of business intelligence: crunching huge amounts of analytical data. So we built a platform that was about analytical speed, all-out speed. To do that we started by building a massively parallel computational system based on Lynx (now LynuxWorks). The hardware platform was built from pure commodity products: AMD 333MHz chips, IBM SSA drives, Ethernet interconnects.

Then we started to work on the next generation, called Lightning. It will scale up to 1.2GHz chips with bigger drives, but it's still rack-mounted and still fifty times the performance of the NCR and Sun StarFire boxes.

Doc: And this is where Linux came in?

John: Right. We decided the best way to do the new platform was to port the existing platform over to Linux. Which we did. We now run our ASP in our data centers in San Francisco and Bracknell, outside of London, and sell services on top of those. We offer outsourced turnkey analytical solutions.

Doc: Give me some more technical detail here.

John: The way this is written is all about analytics. SQL interface via ODBC. The data exploration server is a two-tier architectured system. The first layer is a Sun UNIX platform running Solaris. These are our communications processors that interface with external systems and take in data feeds—loads, streams. We can run these in parallel. Whether ASCII, EBCDIC or whatever, we convert the data into binary and load it into the massively parallel part of the system, the DES (data exploration server).

Doc: Which runs on Linux.

John: Right. This is what used to run on Lynx and now runs on Linux. We can set these up so they're always running in parallel.

Doc: What are the performance improvements over the old system?

John: At Freeserve, a large ISP in the UK, it took 28 hours to do their log files for daily reporting. We came in and set up the load process on a one-rack system. It loaded in 50 minutes.

Doc: How much better does Linux perform than the old system?

John: We have a table where we compared the new and the old OSes. It lists about 105 operations. When we averaged them all out, Linux came out 20 times faster.

Doc: Not too sloppy.

John: What we do is all about performance. We wouldn't have made the change if the difference wasn't so high. Between 85 and 95 operations were faster on Linux. Access to memory, disk reads, swapping cache, disk writes...Linux kicked butt on nearly all of them.

Doc: Tell me more about how your system works.

John: Look at it as a two-tiered system that is massively parallel and can grow with any client requirements. One of the technical benefits is that a database made of homogeneous data is as easy to maintain and operate as one made of fifty different data sources. We're taking the entire database, bringing it to binary and analyzing the entire base every time we run against it. So rather than optimization, indexes and other performance enhancing techniques that DBAs (database administrators) use, we just put in the data and run against all of it all of the time. Just snap in more disk and processor cards. Grow incrementally. The total cost of operation is marginal compared to a Sun or an Oracle implementation. We're not worried about user queries, usage patterns or load times, which are where DBAs spend most of their time. Their question is, “How do you load two terabytes of data in the batch window?” We answer that question. We have an implementation in the US where we're loading 85GB a day in real time, and the system isn't even straining.

From a customer perspective, people want to look at divergent sets of data: product, customer, usage, whatever. What's traditionally done is make them build different databases. That fragments the view of the organization and the customers. While we're loading so much data per day, we're taking a census of the activity of the customer, the services, the products they're using—with this unified data ensemble. We're enabling them to look at all the views of the business in one data set. They can look at customers, products, services, network capacity, pricing analysis, strategic planning on market entry, whatever—all on one data set.

Doc: Is this a standalone system?

John: This goes in as a complement to other platforms. It adds scaling of data and analytical processing. It doesn't displace the data warehouse. We're talking about building a facility that allows simple reporting, multidimensional analysis, data mining and data exploration in a very active platform. If you look at most data warehouses, they're pretty static. We're offering them a dynamic facility. This dynamism is allowed—at virtually no cost whatsoever—by Linux.

Doc: So why Linux?

John: It's so flexible. People usually hear about Linux as a file, print or web server. We're asking it to do a similar thing. Swapping memory, moving data on and off disk, parsing an SQL statement and a ton of computation. The OS does everything in the box. But the key advantage with Linux is that we can expand just by adding more racks. Plug and crunch.

Doc: It should be easier to manage as well.

John: It's very easy to monitor and administer. We monitor client systems constantly, sending alerts saying, for example, “You're going to go critical on disk.” This lets them give us the signal to add more cards. Then it's just a hardware cost. The system reconfigures itself and away it goes. It senses the presence of another card, decides if it's a processor or disk card, addresses it and carries on.

Doc: Essentially the OS falls away both as a cost and an issue.

John: Exactly. The only issue is hardware, and managing it is easy and routine. A lot of people think there's nothing new in hardware, but this is a new and simple way to add functionality by adding components that work together in a way nobody thought they would.

Doc: Can't beat the cost on the OS side, either.

John: Right. Go get a hot dog and some Linux. But the openness is a huge benefit too. What we're doing is an open systems approach engineered in a unique way. People accept Sun's Starfire and NCR's WorldMark servers as open, but it's not true. Linux is truly open. It's so ready, any time you want. When you've got commodity chips strapped together with commodity drives hooked together with fast Ethernet interconnects, then you want a commodity OS, and Linux is it.

Doc: Are you using a Linux distribution, or something from one of the embedded toolchain companies?

John: The Linux is all ours. We took the basic kernel and extended it with our own drivers, just like we had done with Lynx.

Doc Searls is senior editor of Linux Journal and a coauthor of The Cluetrain Manifesto.

Load Disqus comments