The Large Hadron Collider
The following article is featured in the November issue of Linux Journal (#199). Subscribe to see more articles like this and have them delivered to you every month! http://www.linuxjournal.com/subscribe
Muons and mesons and quarks—oh my! Never fear, Dorothy, the Large Hadron Collider and open-source software will save the day.
What is at the heart of the Large Hadron Collider (LHC) experiments? It should not surprise you that open-source software is one of the things that powers the most complex scientific human endeavor ever attempted. I hope to give you a glimpse into how scientific computing embraces open-source software and the open-source philosophy in one of the LHC experiments.
The Tiered Computing Model
The LHC near Geneva, Switzerland, is nearly 100 meters underground and provides the highest-energy subatomic particle beams ever produced. The goal of the LHC is to give physicists a window into the universe immediately after the big bang. However, when physicists calculated the level of computing power needed to peer through that window, it became clear that it would not be possible to do it with only the computers that could fit under one roof.
Even with the promise of Moore's Law, it was apparent that the experiments would have to include a grid technology and decentralize the computing. Part of the decentralization plans included adoption of a tiered model of computing that creates large data storage and analysis centers around the world.
The Compact Muon Solenoid (CMS) experiment is one of the large collider experiments located at the LHC. The primary computing resource for CMS is located at the LHC laboratory and is called the Tier-0. The function of Tier-0 is to record data as it comes off the detector, archive it and transfer it to the Tier-1 facilities around the globe. Ideally, every participating CMS nation has one Tier-1 facility. In the United States, the Tier-1 is located at Fermi National Laboratory (FNAL) in Batavia, Illinois. Each Tier-1 facility is charged with additional archival storage, as well as physics reconstruction and analysis and transferring data to the Tier-2 centers. The Tier-2 centers serve as an analysis resource funded by CMS for physicists. Individuals and universities are free to construct Tier-3 sites, which are not paid for through CMS.
Currently, there are eight CMS Tier-2 centers in the US. Their locations at universities allow CMS to utilize the computing expertise at those institutions and contribute to the educational opportunities for their students. I work as a system administrator at the CMS Tier-2 facility at the University of Nebraska-Lincoln.
By most standards, the Tier-2 centers are large computing resources. Currently, the capabilities of the Tier-2 at Nebraska include approximately 300 servers with 1,500 CPU cores dedicated to computing along with more than 800 terabytes of disk storage. We have network connectivity to the Tier-1 at FNAL of 10 gigabits per second.
LHC? CMS? ATLAS? I'm Confused.
It is easy to lose track of the entities in the high-energy physics world. The Large Hadron Collider (LHC) is the accelerator that provides the beams of high-energy particles, which are protons. It is located at CERN (Conseil Européen pour la Recherche Nucléaire). CERN is the laboratory, and the LHC is the machine. The Compact Muon Solenoid (CMS) is a large particle detector designed to record what particles are created by the collision of the beams of protons (a muon is an elementary particle similar to the electron). CMS is only one of the experiments at CERN. CMS also is used to refer to the large collaboration of scientists that analyze the data recorded from the CMS detector. Most American physicists participating in CMS are in an organization called USCMS. Other experiments at the LHC include ATLAS, ALICE, LHCb, TOTEM and LHCf. These experiments use their own analysis software but share some grid infrastructures with CMS.
Data Movement
One of the technically more difficult obstacles for CMS computing is managing the data. Data movement is managed using a custom framework called PhEDEx (Physics Experiment Data Export). PhEDEx does not actually move data but serves as a mechanism to initiate transfers between sites. PhEDEx agents running at each site interface with database servers located at CERN to determine what data are needed at that site. X509 proxy certificates are used to authenticate transfers between gridftp doors at the source and destination sites. The Tier-2 at Nebraska has 12 gridftp doors and has sustained transfer rates up to 800 megabytes per second.
It should be noted that the word data can mean a few different things to a physicist. It can refer to the digitized readouts of a detector, Monte Carlo simulation of outputs, or the bits and bytes stored on hard drives and tapes.
The network demands made by the Nebraska Tier-2 site have generated interesting research in computer network engineering. Nebraska was the first university to demonstrate large data movement over a dynamically allocated IP path. When Nebraska's Tier-2 is pulling a large amount of data from the Tier-1 at FNAL, a separate IP path automatically is constructed to prevent traffic from adversely affecting the university's general Internet usage.
Since data transfer and management is such a crucial element for the success of CMS, developing the underlying system has been ongoing for years. The transfer volume of Monte Carlo samples and real physics data already has surpassed 54 petabytes worldwide. Nebraska alone has downloaded 900 terabytes during the past calendar year. All this data movement has been done via commodity servers running open-source software.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- RSS Feeds
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Designing Electronics with Linux
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- What's the tweeting protocol?
- Kernel Problem
9 hours 46 min ago - BASH script to log IPs on public web server
14 hours 13 min ago - DynDNS
17 hours 48 min ago - Reply to comment | Linux Journal
18 hours 21 min ago - All the articles you talked
20 hours 44 min ago - All the articles you talked
20 hours 47 min ago - All the articles you talked
20 hours 49 min ago - myip
1 day 1 hour ago - Keeping track of IP address
1 day 3 hours ago - Roll your own dynamic dns
1 day 8 hours ago





Comments
Great article!
Thank you for that article, I love reading about the LHC, and to cover linux too? Cool!
Ah, the LHC is so cool, this review wouldn't be complete if we didn't mention the LHC music video:
The Large Hadron Rap:
http://www.youtube.com/watch?v=j50ZssEojtM
The LHC rocks!
graphs
Loved the article. Very interesting to see some hardcore usage of Linux.
I have a simple question ... What program was used to create the bar charts in the article?
re: graphs
I did not generate the graphs by hand, the graphs are generated by a service called 'gratia'. Gratia is the accounting done within OSG. IIRC, the bulk of the graph making is done by 'graphtool'.
http://t2.unl.edu/documentation/graphtool.
Carl
Great read...
however.. this does beg the question.. with all this FOSS software being used.. and modified to suit their needs.. are they also contributing back their modifications for the good of all..? i quote the following..
"Both CERN and FNAL have their own Linux distributions but add improvements and customizations into the Scientific Linux distribution. "
apart for that.. it is great that they are taking advantage of the best software available to them..
ROOT
On the data analysis end of things, CERN / Fermilab have developed ROOT, a C++ based data analysis framework. Whilst it mainly caters for the needs of particle physics data analysis, it is a very powerful framework and I've heard of people in insurance and finance industries using it for data mining. See root.cern.ch
Re: Great Read....
"are they also contributing back their modifications for the good of all..?"
A good question, but one I didn't think to address more clearly in the article. We certainly try to give back where we can. The experiments employ numerous developers and they, when allowed by their campus/lab MOU pass patches back upstream. I also know of numerous patches Nebraska has submitted to Hadoop and Caltech has worked hard on packaging (RPM) HDFS (although I don't personally know if the Hadoop project adopts them). (IIRC, my counterpart at Caltech is on the Fedora RPM team. Don't quote me on that.)
The network infrastructure that's being built and researched for the LHC almost has to benefit everyone. The connectivity between North America and Europe has improved at a much faster rate due to the demands of the LHC.
Our community tries it's best to give back to the code bases we use in the form of patches and improvements as much as we can. I don't know if we could/should do more, but I'm personally satisfied that we aren't exploiting FOSS.
Cheers.
CERN
I used to work at Brookhaven National Laboratory and can explain the situation further. First of all most linux machines in the LHC and ATLAS clusters run scientific linux.
http://en.wikipedia.org/wiki/Scientific_linux
It's a centos based derivative created by CERN. There's MANY MANY open source tools in use at almost ALL labs. Argonne actually created a tool called big config (http://en.wikipedia.org/wiki/Bcfg2) and JLAB has also written and open source much software. Condor is an open source queue management software used for cluster job submission. This is also used and patches submitted upstream.
Actually, Scientific Linux is
Actually, Scientific Linux is Redhat-based and is created/supported jointly by Fermilab and CERN.
Well, they invented the world
Well, they invented the world wide web for one ;-) http://en.wikipedia.org/wiki/Www#History
Excellent Article!
Thoroughly enjoyed reading through this, thank you for the glimpse into the goings on at the LHC and its experiments. Nice to see oss being used in such an important and exciting experiment :)
open HPC software stack
Great read! I'm excited to see the tools used and the choices made on such a data intensive project. Wrestling with many of these same issues on a smaller scale, it is nice to see just how far these tools will scale.
I'll be honest with you...
After reading this, I have come to a decision: I believe that man has begun to think too much!
...I'm not just a "troll", but also a subscriber!
think too much
Hey J, we're all Devo!