Plug and Crunch: WhiteCross' Linux Story
This brings us to the Personalization Summit in New York (April 2-4, 2001), where I was on a panel about “The Future of Personalization” (briefly, I said there wasn't much hope, but that's another story). Sitting near me in the audience was John K. Thompson, VP Worldwide Marketing for a company called WhiteCross.
“You know”, he said, “we have the world's fastest Linux analytical platform. Interested?”
Well, yeah. So a couple hours later we met again, and an interview followed.
Doc: Where did WhiteCross come from?
John: We were founded in 1992 by guys who left Teradata. From the beginning we were in the business of business intelligence: crunching huge amounts of analytical data. So we built a platform that was about analytical speed, all-out speed. To do that we started by building a massively parallel computational system based on Lynx (now LynuxWorks). The hardware platform was built from pure commodity products: AMD 333MHz chips, IBM SSA drives, Ethernet interconnects.
Then we started to work on the next generation, called Lightning. It will scale up to 1.2GHz chips with bigger drives, but it's still rack-mounted and still fifty times the performance of the NCR and Sun StarFire boxes.
Doc: And this is where Linux came in?
John: Right. We decided the best way to do the new platform was to port the existing platform over to Linux. Which we did. We now run our ASP in our data centers in San Francisco and Bracknell, outside of London, and sell services on top of those. We offer outsourced turnkey analytical solutions.
Doc: Give me some more technical detail here.
John: The way this is written is all about analytics. SQL interface via ODBC. The data exploration server is a two-tier architectured system. The first layer is a Sun UNIX platform running Solaris. These are our communications processors that interface with external systems and take in data feeds—loads, streams. We can run these in parallel. Whether ASCII, EBCDIC or whatever, we convert the data into binary and load it into the massively parallel part of the system, the DES (data exploration server).
Doc: Which runs on Linux.
John: Right. This is what used to run on Lynx and now runs on Linux. We can set these up so they're always running in parallel.
Doc: What are the performance improvements over the old system?
John: At Freeserve, a large ISP in the UK, it took 28 hours to do their log files for daily reporting. We came in and set up the load process on a one-rack system. It loaded in 50 minutes.
Doc: How much better does Linux perform than the old system?
John: We have a table where we compared the new and the old OSes. It lists about 105 operations. When we averaged them all out, Linux came out 20 times faster.
Doc: Not too sloppy.
John: What we do is all about performance. We wouldn't have made the change if the difference wasn't so high. Between 85 and 95 operations were faster on Linux. Access to memory, disk reads, swapping cache, disk writes...Linux kicked butt on nearly all of them.
Doc: Tell me more about how your system works.
John: Look at it as a two-tiered system that is massively parallel and can grow with any client requirements. One of the technical benefits is that a database made of homogeneous data is as easy to maintain and operate as one made of fifty different data sources. We're taking the entire database, bringing it to binary and analyzing the entire base every time we run against it. So rather than optimization, indexes and other performance enhancing techniques that DBAs (database administrators) use, we just put in the data and run against all of it all of the time. Just snap in more disk and processor cards. Grow incrementally. The total cost of operation is marginal compared to a Sun or an Oracle implementation. We're not worried about user queries, usage patterns or load times, which are where DBAs spend most of their time. Their question is, “How do you load two terabytes of data in the batch window?” We answer that question. We have an implementation in the US where we're loading 85GB a day in real time, and the system isn't even straining.
From a customer perspective, people want to look at divergent sets of data: product, customer, usage, whatever. What's traditionally done is make them build different databases. That fragments the view of the organization and the customers. While we're loading so much data per day, we're taking a census of the activity of the customer, the services, the products they're using—with this unified data ensemble. We're enabling them to look at all the views of the business in one data set. They can look at customers, products, services, network capacity, pricing analysis, strategic planning on market entry, whatever—all on one data set.
Doc: Is this a standalone system?
John: This goes in as a complement to other platforms. It adds scaling of data and analytical processing. It doesn't displace the data warehouse. We're talking about building a facility that allows simple reporting, multidimensional analysis, data mining and data exploration in a very active platform. If you look at most data warehouses, they're pretty static. We're offering them a dynamic facility. This dynamism is allowed—at virtually no cost whatsoever—by Linux.
Doc: So why Linux?
John: It's so flexible. People usually hear about Linux as a file, print or web server. We're asking it to do a similar thing. Swapping memory, moving data on and off disk, parsing an SQL statement and a ton of computation. The OS does everything in the box. But the key advantage with Linux is that we can expand just by adding more racks. Plug and crunch.
Doc: It should be easier to manage as well.
John: It's very easy to monitor and administer. We monitor client systems constantly, sending alerts saying, for example, “You're going to go critical on disk.” This lets them give us the signal to add more cards. Then it's just a hardware cost. The system reconfigures itself and away it goes. It senses the presence of another card, decides if it's a processor or disk card, addresses it and carries on.
Doc: Essentially the OS falls away both as a cost and an issue.
John: Exactly. The only issue is hardware, and managing it is easy and routine. A lot of people think there's nothing new in hardware, but this is a new and simple way to add functionality by adding components that work together in a way nobody thought they would.
Doc: Can't beat the cost on the OS side, either.
John: Right. Go get a hot dog and some Linux. But the openness is a huge benefit too. What we're doing is an open systems approach engineered in a unique way. People accept Sun's Starfire and NCR's WorldMark servers as open, but it's not true. Linux is truly open. It's so ready, any time you want. When you've got commodity chips strapped together with commodity drives hooked together with fast Ethernet interconnects, then you want a commodity OS, and Linux is it.
Doc: Are you using a Linux distribution, or something from one of the embedded toolchain companies?
John: The Linux is all ours. We took the basic kernel and extended it with our own drivers, just like we had done with Lynx.
Doc Searls is senior editor of Linux Journal and a coauthor of The Cluetrain Manifesto.
Doc Searls is Senior Editor of Linux Journal
|Designing Electronics with Linux||May 22, 2013|
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
- Designing Electronics with Linux
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- New Products
- Using Salt Stack and Vagrant for Drupal Development
- Validate an E-Mail Address with PHP, the Right Way
- Build a Skype Server for Your Home Phone System
- Why Python?
- Tech Tip: Really Simple HTTP Server with Python
- A Topic for Discussion - Open Source Feature-Richness?
- Not free anymore
1 hour 19 min ago
5 hours 7 min ago
- Reply to comment | Linux Journal
5 hours 15 min ago
- Understanding the Linux Kernel
7 hours 29 min ago
9 hours 59 min ago
- Kernel Problem
20 hours 2 min ago
- BASH script to log IPs on public web server
1 day 29 min ago
1 day 4 hours ago
- Reply to comment | Linux Journal
1 day 4 hours ago
- All the articles you talked
1 day 7 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?