The Large Hadron Collider

 in

The following article is featured in the November issue of Linux Journal (#199). Subscribe to see more articles like this and have them delivered to you every month! http://www.linuxjournal.com/subscribe

Muons and mesons and quarks—oh my!  Never fear, Dorothy, the Large Hadron Collider and open-source software will save the day.

What is at the heart of the Large Hadron Collider (LHC) experiments? It should not surprise you that open-source software is one of the things that powers the most complex scientific human endeavor ever attempted. I hope to give you a glimpse into how scientific computing embraces open-source software and the open-source philosophy in one of the LHC experiments.

The Tiered Computing Model

The LHC near Geneva, Switzerland, is nearly 100 meters underground and provides the highest-energy subatomic particle beams ever produced. The goal of the LHC is to give physicists a window into the universe immediately after the big bang. However, when physicists calculated the level of computing power needed to peer through that window, it became clear that it would not be possible to do it with only the computers that could fit under one roof.

Even with the promise of Moore's Law, it was apparent that the experiments would have to include a grid technology and decentralize the computing. Part of the decentralization plans included adoption of a tiered model of computing that creates large data storage and analysis centers around the world.

The Compact Muon Solenoid (CMS) experiment is one of the large collider experiments located at the LHC. The primary computing resource for CMS is located at the LHC laboratory and is called the Tier-0. The function of Tier-0 is to record data as it comes off the detector, archive it and transfer it to the Tier-1 facilities around the globe. Ideally, every participating CMS nation has one Tier-1 facility. In the United States, the Tier-1 is located at Fermi National Laboratory (FNAL) in Batavia, Illinois. Each Tier-1 facility is charged with additional archival storage, as well as physics reconstruction and analysis and transferring data to the Tier-2 centers. The Tier-2 centers serve as an analysis resource funded by CMS for physicists. Individuals and universities are free to construct Tier-3 sites, which are not paid for through CMS.

Currently, there are eight CMS Tier-2 centers in the US. Their locations at universities allow CMS to utilize the computing expertise at those institutions and contribute to the educational opportunities for their students. I work as a system administrator at the CMS Tier-2 facility at the University of Nebraska-Lincoln.

By most standards, the Tier-2 centers are large computing resources. Currently, the capabilities of the Tier-2 at Nebraska include approximately 300 servers with 1,500 CPU cores dedicated to computing along with more than 800 terabytes of disk storage. We have network connectivity to the Tier-1 at FNAL of 10 gigabits per second.

LHC? CMS? ATLAS? I'm Confused.

It is easy to lose track of the entities in the high-energy physics world. The Large Hadron Collider (LHC) is the accelerator that provides the beams of high-energy particles, which are protons. It is located at CERN (Conseil Européen pour la Recherche Nucléaire). CERN is the laboratory, and the LHC is the machine. The Compact Muon Solenoid (CMS) is a large particle detector designed to record what particles are created by the collision of the beams of protons (a muon is an elementary particle similar to the electron). CMS is only one of the experiments at CERN. CMS also is used to refer to the large collaboration of scientists that analyze the data recorded from the CMS detector. Most American physicists participating in CMS are in an organization called USCMS. Other experiments at the LHC include ATLAS, ALICE, LHCb, TOTEM and LHCf. These experiments use their own analysis software but share some grid infrastructures with CMS.

Data Movement

One of the technically more difficult obstacles for CMS computing is managing the data. Data movement is managed using a custom framework called PhEDEx (Physics Experiment Data Export). PhEDEx does not actually move data but serves as a mechanism to initiate transfers between sites. PhEDEx agents running at each site interface with database servers located at CERN to determine what data are needed at that site. X509 proxy certificates are used to authenticate transfers between gridftp doors at the source and destination sites. The Tier-2 at Nebraska has 12 gridftp doors and has sustained transfer rates up to 800 megabytes per second.

It should be noted that the word data can mean a few different things to a physicist. It can refer to the digitized readouts of a detector, Monte Carlo simulation of outputs, or the bits and bytes stored on hard drives and tapes.

The network demands made by the Nebraska Tier-2 site have generated interesting research in computer network engineering. Nebraska was the first university to demonstrate large data movement over a dynamically allocated IP path. When Nebraska's Tier-2 is pulling a large amount of data from the Tier-1 at FNAL, a separate IP path automatically is constructed to prevent traffic from adversely affecting the university's general Internet usage.

Since data transfer and management is such a crucial element for the success of CMS, developing the underlying system has been ongoing for years. The transfer volume of Monte Carlo samples and real physics data already has surpassed 54 petabytes worldwide. Nebraska alone has downloaded 900 terabytes during the past calendar year. All this data movement has been done via commodity servers running open-source software.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Great article!

drokmed's picture

Thank you for that article, I love reading about the LHC, and to cover linux too? Cool!

Ah, the LHC is so cool, this review wouldn't be complete if we didn't mention the LHC music video:

The Large Hadron Rap:

http://www.youtube.com/watch?v=j50ZssEojtM

The LHC rocks!

graphs

Bob Lounsbury's picture

Loved the article. Very interesting to see some hardcore usage of Linux.

I have a simple question ... What program was used to create the bar charts in the article?

re: graphs

clundst's picture

I did not generate the graphs by hand, the graphs are generated by a service called 'gratia'. Gratia is the accounting done within OSG. IIRC, the bulk of the graph making is done by 'graphtool'.

http://t2.unl.edu/documentation/graphtool.

Carl

Great read...

Anonymous's picture

however.. this does beg the question.. with all this FOSS software being used.. and modified to suit their needs.. are they also contributing back their modifications for the good of all..? i quote the following..

"Both CERN and FNAL have their own Linux distributions but add improvements and customizations into the Scientific Linux distribution. "

apart for that.. it is great that they are taking advantage of the best software available to them..

ROOT

nced's picture

On the data analysis end of things, CERN / Fermilab have developed ROOT, a C++ based data analysis framework. Whilst it mainly caters for the needs of particle physics data analysis, it is a very powerful framework and I've heard of people in insurance and finance industries using it for data mining. See root.cern.ch

Re: Great Read....

clundst's picture

"are they also contributing back their modifications for the good of all..?"

A good question, but one I didn't think to address more clearly in the article. We certainly try to give back where we can. The experiments employ numerous developers and they, when allowed by their campus/lab MOU pass patches back upstream. I also know of numerous patches Nebraska has submitted to Hadoop and Caltech has worked hard on packaging (RPM) HDFS (although I don't personally know if the Hadoop project adopts them). (IIRC, my counterpart at Caltech is on the Fedora RPM team. Don't quote me on that.)

The network infrastructure that's being built and researched for the LHC almost has to benefit everyone. The connectivity between North America and Europe has improved at a much faster rate due to the demands of the LHC.

Our community tries it's best to give back to the code bases we use in the form of patches and improvements as much as we can. I don't know if we could/should do more, but I'm personally satisfied that we aren't exploiting FOSS.

Cheers.

CERN

Anonymous's picture

I used to work at Brookhaven National Laboratory and can explain the situation further. First of all most linux machines in the LHC and ATLAS clusters run scientific linux.
http://en.wikipedia.org/wiki/Scientific_linux
It's a centos based derivative created by CERN. There's MANY MANY open source tools in use at almost ALL labs. Argonne actually created a tool called big config (http://en.wikipedia.org/wiki/Bcfg2) and JLAB has also written and open source much software. Condor is an open source queue management software used for cluster job submission. This is also used and patches submitted upstream.

Actually, Scientific Linux is

Anonymous's picture

Actually, Scientific Linux is Redhat-based and is created/supported jointly by Fermilab and CERN.

Well, they invented the world

Anonymous's picture

Well, they invented the world wide web for one ;-) http://en.wikipedia.org/wiki/Www#History

Excellent Article!

Dan Haworth's picture

Thoroughly enjoyed reading through this, thank you for the glimpse into the goings on at the LHC and its experiments. Nice to see oss being used in such an important and exciting experiment :)

open HPC software stack

Gavin's picture

Great read! I'm excited to see the tools used and the choices made on such a data intensive project. Wrestling with many of these same issues on a smaller scale, it is nice to see just how far these tools will scale.

I'll be honest with you...

JShuford's picture

After reading this, I have come to a decision: I believe that man has begun to think too much!

...I'm not just a "troll", but also a subscriber!

think too much

Mr. Crankypants's picture

Hey J, we're all Devo!

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState