System Administration of the IBM Watson Supercomputer

Configuration Management of the Watson Cluster

CSM is IBM's proprietary Cluster Systems Management software (http://www-03.ibm.com/systems/software/csm). It is intended to simplify administration of a cluster and includes parallel execution capability for high-volume pushes:

[CSM is] designed for simple, low-cost management of distributed and clustered IBM Power Systems in technical and commercial computing environments. CSM, included with the IBM Power Systems high-performance computer solutions, dramatically simplifies administration of a cluster by providing management from a single point-of-control....In addition to providing all the key functions for administration and maintenance of typical distributed systems, CSM is designed to deliver the parallel execution required to manage clustered computing environments effectively.

xCAT also originated at IBM. It was open-sourced in 2007. The xCAT Project slogan is "Extreme Cloud Administration Toolkit", and its logo is a cat skull and crossbones. It now lives at http://xcat.sourceforge.net, which describes it as follows:

  • Provision operating systems on physical or virtual machines: SLES10 SP2 and higher, SLES 11 (incl. SP1), RHEL5.x, RHEL 6, CentOS4.x, CentOS5.x, SL 5.5, Fedora 8-14, AIX 6.1, 7.1 (all available technology levels), Windows 2008, Windows 7, VMware, KVM, PowerVM and zVM.

  • Scripted install, stateless, satellite, iSCSI or cloning.

  • Remotely manage systems: integrated lights-out management, remote console, and distributed shell support.

  • Quickly set up and control management node services: DNS, HTTP, DHCP and TFTP.

xCAT offers complete and ideal management for HPC clusters, render farms, grids, WebFarms, on-line gaming infrastructure, clouds, data centers, and whatever tomorrow's buzzwords may be. It is agile, extendible and based on years of system administration best practices and experience.

xCAT grew out of a need to rapidly provision IBM x86-based machines and has been actively developed since 1999. xCAT is now ten years old and continues to evolve.

AT: xCat sounds like an installation system rather than a change management system. Did you use an SSH-based "push" model to push out changes to your systems?

EE: xCat has very powerful push features, including a multithreaded push that interacts with different machines in parallel. It handles OS patches, upgrades and more.

AT: What monitoring tool did you use and why? Did you have any cool visual models of Watson's physical or logical activity?

EE: The project used a home-grown cluster management system for development activities, which had its own monitor. It also incorporated ganglia. This tool was the basis for managing about 1,500 cores.

The Watson game-playing system used UIMA-AS with a simple SSH-based process launcher. The emphasis there was on measuring every aspect of runtime performance in order to reduce the overall latency. Visualization of performance data was then done after the fact. UIMA-AS managed the work on thousands of cores.

What Is UIMA-AS?

UIMA (Unstructured Information Management Architecture) is an open-source technology framework enabling Watson. It is a framework for analyzing a sea of data to discover vital facts. It is computers taking unstructured data as input and turning it into structured data and then analyzing and working with the structured data to produce useful results.

The analysis is "multi-modal", which means many algorithms are employed, and many kinds of algorithms. For example, Watson had a group of algorithms for generating hypotheses, such as using geo-spatial reasoning, temporal reasoning (drawing on its historical database), pun engine and so on, and another group of algorithms for scoring and pruning them to find the most likely answer.

In a nutshell, this is Massively Parallel Probabilistic Evidence-Based Architecture. (The evidence comes from Watson's 400TB corpus of data.)

The "AS" stands for Asynchronous Scaleout, and it's a scaling framework for UIMA—a way to run UIMA on modern, highly parallel cores, to benefit from the continuing advance in technology. UIMA brings "thinking computers" a giant step closer.

To understand unstructured information, first let's look at structured information. Computers speak with each other using structured information. Sticking to structured information makes it easier to extract meaning from data. HTML and XML are examples of structured information. So is a CSV file. Structured information standards are maintained by OASIS at http://www.sgmlopen.org.

Unstructured information is much more fluid and free-form. Human communication uses unstructured information. Until UIMA, computers have been unable to make sense out of unstructured information. Examples of unstructured information include audio (music), e-mails, medical records, technical reports, blogs, books and speech.

UIMA was originally an internal IBM Research project. It is a framework for creating applications that do deep analysis of natural human language text and speech.

In Watson, UIMA managed the work on nearly 3,000 cores. Incidentally, Watson could run on a single core—it would take it six hours to answer a question. With 3,000 cores, that time is cut to 2–6 seconds. Watson really takes advantage of massively parallel architecture to speed up its processing.

AT: What were the most useful system administration tools for you in handling Watson and why?

EE: clusterSSH (http://sourceforge.net/apps/mediawiki/clusterssh) was quite useful. That and simple shell scripts with SSH did most of the work.

AT: How did you handle upgrading Watson software? SSH in, shut down the service, update the package, start the service? Or?

EE: Right, the Watson application is just restarted to pick up changes.

AT: How did you handle packaging of Watson software?

EE: The Watson game player was never packaged up to be delivered elsewhere.

AT: How many sysadmins do you have handling how many servers? You mentioned there were hundreds of operating system instances—could you be more specific? (How many humans and how many servers?) Is there actually a dedicated system administration staff, or do some of the researchers wear the system administrator hat along with their researcher duties?

EE: We have in the order of 800 OS instances. After four years we finally hired a sysadmin; before that, it was a part-time job for each of three researchers with root access.

AT: Regarding your monitoring system, how did you output the system status?

EE: We are not a production shop. If the cluster has a problem, only our colleagues complain.

What's Next?

IBM wants to make DeepQA useful, not just entertaining. Possible fields of application include healthcare, life sciences, tech support, enterprise knowledge management and business intelligence, government, improved information sharing and security.

Resources

IBM's Watson Site—"What is Watson?", "Building Watson" and "Watson for a Smarter Planet": http://ibmwatson.com

IBM's DeepQA Project: http://www.research.ibm.com/deepqa/deepqa.shtml

Eddie Epstein's IBM Researcher Profile: http://researcher.ibm.com/view.php?person=us-eae

Wikipedia Article on Watson: http://en.wikipedia.org/wiki/Watson_%28computer%29

Apache UIMA: http://uima.apache.org

______________________

Aleksey Tsalolikhin has been a UNIX/Linux system administrator for 14 years.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Syntatic frames make my head hurt!

GregD44's picture

Very interesting article. "Syntatic frames" are a bit beyond my understanding, but still interesting.

Thanks for sharing

spc's picture

Thanks for sharing the information, It's very informative and helpful. and serasa

System Administration of the IBM Watson Supercomputer | Linux

Zoe's picture

This blog genuinely will need my comment : this internet site is good !
Please take a look at it and enjoy it you may discover it certainly exciting!

Reply to comment | Linux Journal

google's picture

Hi just wanted to give you a quick heads up and let you know a
few of the pictures aren't loading correctly. I'm not sure why but I
think its a linking issue. I've tried it in two different web browsers and both show the same outcome.

Reply to comment | Linux Journal

{Porsche dealers|Porsche dealer|911 porsche|preowned 's picture

Good post. I learn something totally new and challenging on
sites I stumbleupon on a daily basis. It's always helpful to read through content from other writers and practice a little something from their web sites.

Reply to comment | Linux Journal

see through toaster's picture

I'm impressed, I have to admit. Rarely do I come across a blog that's equally educative and amusing,
and let me tell you, you have hit the nail on the head.
The problem is something that too few folks are speaking intelligently about.
Now i'm very happy I came across this in my search for something relating to this.

Reply to comment | Linux Journal

civil engineer's picture

Great post however , I was wanting to know if you could write a litte more
on this subject? I'd be very thankful if you could elaborate a little bit further. Cheers!

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix