Plug and Crunch: WhiteCross' Linux Story
This brings us to the Personalization Summit in New York (April 2-4, 2001), where I was on a panel about “The Future of Personalization” (briefly, I said there wasn't much hope, but that's another story). Sitting near me in the audience was John K. Thompson, VP Worldwide Marketing for a company called WhiteCross.
“You know”, he said, “we have the world's fastest Linux analytical platform. Interested?”
Well, yeah. So a couple hours later we met again, and an interview followed.
Doc: Where did WhiteCross come from?
John: We were founded in 1992 by guys who left Teradata. From the beginning we were in the business of business intelligence: crunching huge amounts of analytical data. So we built a platform that was about analytical speed, all-out speed. To do that we started by building a massively parallel computational system based on Lynx (now LynuxWorks). The hardware platform was built from pure commodity products: AMD 333MHz chips, IBM SSA drives, Ethernet interconnects.
Then we started to work on the next generation, called Lightning. It will scale up to 1.2GHz chips with bigger drives, but it's still rack-mounted and still fifty times the performance of the NCR and Sun StarFire boxes.
Doc: And this is where Linux came in?
John: Right. We decided the best way to do the new platform was to port the existing platform over to Linux. Which we did. We now run our ASP in our data centers in San Francisco and Bracknell, outside of London, and sell services on top of those. We offer outsourced turnkey analytical solutions.
Doc: Give me some more technical detail here.
John: The way this is written is all about analytics. SQL interface via ODBC. The data exploration server is a two-tier architectured system. The first layer is a Sun UNIX platform running Solaris. These are our communications processors that interface with external systems and take in data feeds—loads, streams. We can run these in parallel. Whether ASCII, EBCDIC or whatever, we convert the data into binary and load it into the massively parallel part of the system, the DES (data exploration server).
Doc: Which runs on Linux.
John: Right. This is what used to run on Lynx and now runs on Linux. We can set these up so they're always running in parallel.
Doc: What are the performance improvements over the old system?
John: At Freeserve, a large ISP in the UK, it took 28 hours to do their log files for daily reporting. We came in and set up the load process on a one-rack system. It loaded in 50 minutes.
Doc: How much better does Linux perform than the old system?
John: We have a table where we compared the new and the old OSes. It lists about 105 operations. When we averaged them all out, Linux came out 20 times faster.
Doc: Not too sloppy.
John: What we do is all about performance. We wouldn't have made the change if the difference wasn't so high. Between 85 and 95 operations were faster on Linux. Access to memory, disk reads, swapping cache, disk writes...Linux kicked butt on nearly all of them.
Doc: Tell me more about how your system works.
John: Look at it as a two-tiered system that is massively parallel and can grow with any client requirements. One of the technical benefits is that a database made of homogeneous data is as easy to maintain and operate as one made of fifty different data sources. We're taking the entire database, bringing it to binary and analyzing the entire base every time we run against it. So rather than optimization, indexes and other performance enhancing techniques that DBAs (database administrators) use, we just put in the data and run against all of it all of the time. Just snap in more disk and processor cards. Grow incrementally. The total cost of operation is marginal compared to a Sun or an Oracle implementation. We're not worried about user queries, usage patterns or load times, which are where DBAs spend most of their time. Their question is, “How do you load two terabytes of data in the batch window?” We answer that question. We have an implementation in the US where we're loading 85GB a day in real time, and the system isn't even straining.
From a customer perspective, people want to look at divergent sets of data: product, customer, usage, whatever. What's traditionally done is make them build different databases. That fragments the view of the organization and the customers. While we're loading so much data per day, we're taking a census of the activity of the customer, the services, the products they're using—with this unified data ensemble. We're enabling them to look at all the views of the business in one data set. They can look at customers, products, services, network capacity, pricing analysis, strategic planning on market entry, whatever—all on one data set.
Doc: Is this a standalone system?
John: This goes in as a complement to other platforms. It adds scaling of data and analytical processing. It doesn't displace the data warehouse. We're talking about building a facility that allows simple reporting, multidimensional analysis, data mining and data exploration in a very active platform. If you look at most data warehouses, they're pretty static. We're offering them a dynamic facility. This dynamism is allowed—at virtually no cost whatsoever—by Linux.
Doc: So why Linux?
John: It's so flexible. People usually hear about Linux as a file, print or web server. We're asking it to do a similar thing. Swapping memory, moving data on and off disk, parsing an SQL statement and a ton of computation. The OS does everything in the box. But the key advantage with Linux is that we can expand just by adding more racks. Plug and crunch.
Doc: It should be easier to manage as well.
John: It's very easy to monitor and administer. We monitor client systems constantly, sending alerts saying, for example, “You're going to go critical on disk.” This lets them give us the signal to add more cards. Then it's just a hardware cost. The system reconfigures itself and away it goes. It senses the presence of another card, decides if it's a processor or disk card, addresses it and carries on.
Doc: Essentially the OS falls away both as a cost and an issue.
John: Exactly. The only issue is hardware, and managing it is easy and routine. A lot of people think there's nothing new in hardware, but this is a new and simple way to add functionality by adding components that work together in a way nobody thought they would.
Doc: Can't beat the cost on the OS side, either.
John: Right. Go get a hot dog and some Linux. But the openness is a huge benefit too. What we're doing is an open systems approach engineered in a unique way. People accept Sun's Starfire and NCR's WorldMark servers as open, but it's not true. Linux is truly open. It's so ready, any time you want. When you've got commodity chips strapped together with commodity drives hooked together with fast Ethernet interconnects, then you want a commodity OS, and Linux is it.
Doc: Are you using a Linux distribution, or something from one of the embedded toolchain companies?
John: The Linux is all ours. We took the basic kernel and extended it with our own drivers, just like we had done with Lynx.
Doc Searls is senior editor of Linux Journal and a coauthor of The Cluetrain Manifesto.
Doc Searls is Senior Editor of Linux Journal
Practical Task Scheduling Deployment
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.View Now!
|The Firebird Project's Firebird Relational Database||Jul 29, 2016|
|Stunnel Security for Oracle||Jul 28, 2016|
|SUSE LLC's SUSE Manager||Jul 21, 2016|
|My +1 Sword of Productivity||Jul 20, 2016|
|Non-Linux FOSS: Caffeine!||Jul 19, 2016|
|Murat Yener and Onur Dundar's Expert Android Studio (Wrox)||Jul 18, 2016|
- Stunnel Security for Oracle
- The Firebird Project's Firebird Relational Database
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- SUSE LLC's SUSE Manager
- Managing Linux Using Puppet
- My +1 Sword of Productivity
- Non-Linux FOSS: Caffeine!
- Doing for User Space What We Did for Kernel Space
- Google's SwiftShader Released
- SuperTuxKart 0.9.2 Released
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide