The Large Hadron Collider
Job Management
Once the decision was made to decentralize the analysis resources, a crucial question needed to be answered. How does a physicist in Europe run a job using data stored at Nebraska? In 2004, the computing model for CMS was finalized and embraced the emerging grid technology. At that time, the technical implementation was left flexible to allow for sites to adopt any grid middleware that might emerge. Analysis sites in Europe adopted the World LHC Computing Grid (WLCG) software stack to facilitate analysis. Sites in the US chose the Open Science Grid (OSG) to provide the software to deploy jobs remotely. The two solutions are interoperable.
The OSG's (http://www.opensciencegrid.org) mission is to help in the sharing of computing resources. Virtual Organizations (VOs) can participate in the OSG by providing computing resources and utilizing the extra computing resources provided by other VOs. During the past year, the OSG has provided 280 million hours of computing time to participating VOs. Figure 1 shows the breakdown of those hours by VO during the past year. (The Internet search for the meaning of the VO acronyms is an exercise left to the reader.) Forty million of those hours were provided to VOs not associated with particle physics. Participation in the OSG allows Nebraska to share any idle CPU cycles with other scientists. Furthermore, the CMS operational model for all US Tier-2 sites is that 20% of our average computing is set aside for use by non-CMS VOs. This gives non-CMS VOs an incentive to join the OSG. Non-CMS VO participation increases support and development of the OSG software that allows CMS to benefit from improvements made by other users. The OSG's model should serve as an example for similar collaborative efforts.
Figure 1. A week-by-week accounting of Open Science Grid usage by user VO for the past year.
OSG provides centralized packaging and support for open-source grid middleware. The OSG also gives administrators easy installation of certificate authority credentials. Certification and authentication management is one of the OSG's most useful toolsets. Further, the OSG monitors sites and manages a ticketing system to alert administrators of problems. Full accounting of site utilization is made available by OSG so that funding agencies and top-level management have the tools they need to argue for further expenditures. See Figure 2 for CPU hours provided to the OSG by some of the major facilities.
Figure 2. A week-by-week view of CPU hours provided to the Open Science Grid by computing facilities. Both Nebraska and Firefly are resources provided by the the University of Nebraska.
In short, what SETI@home does with people's desktops, OSG does for research using university computing centers.
Distributed Filesystems
The CMS experiment will generate more than one terabyte of recorded data every day. Every Tier-2 site is expected to store hundreds of terabytes on-site for analysis. How do you effectively store hundreds of terabytes and allow for analysis from grid-submitted jobs?
When we started building the CMS Tier-2 at Nebraska, the answer was a software package written at the high-energy physics (HEP) experiment DESY in Germany called dCache. dCache, or Disk Cache, was a distributed filesystem created by physicists to act as a front end to a large tape storage. This model fit well with the established practices of high-energy physicists. The HEP community had been using tapes to store data for decades. We are experts at utilizing tape. dCache was designed to stage data from slow tapes to fast disks without users having to know anything about tape access. Until recently, dCache used software called PNFS (Perfectly Normal File System, not to be confused with Parallel NFS) to present the dCache filesystem in a POSIX-like way but not quite in a POSIX-compliant way. Data stored in dCache had to be accessed using dCache-specific protocols or via grid interfaces. Because file access and control was not truly POSIX-compliant, management of the system could be problematic for non-dCache experts.
dCache storage is file-based. All files stored on disk correspond to files in the PNFS namespace. Resilience is managed via a replica manager that attempts to store a single file on multiple storage pools. Although a file-based distributed storage system was easy to manage and manually repair for non-experts using dCache, the architecture could lead to highly unbalanced loads on storage servers. If a large number of jobs were requesting the same file, a single storage server easily could become overworked while the remaining servers were relatively idle.
Our internal studies with dCache found that we were having a better overall experience when using large disk vaults rather than when using hard drives in our cluster worker nodes. This created a problem meeting our storage requirements within our budget. It is much cheaper to purchase hard drives and deploy them in worker nodes than to buy large disk vaults. The CMS computing model does not allow funding for large tape storage at the Tier-2 sites. Data archives are maintained at the Tier-0 and Tier-1 levels. This means the real strength of dCache is not being exploited at the Tier-2 sites.
The problems of scalability and budgeting prompted Nebraska to look to the Open Source world for a different solution. We found Hadoop and HDFS.
Hadoop (http://hadoop.apache.org) is a software framework for distributed computing. It is a top-level Apache project and actively supported by many commercial interests. We were not interested in the computational packages in Hadoop, but we were very interested in HDFS, which is the distributed filesystem that Hadoop provides. HDFS allowed us to utilize the available hard drive slots in the worker nodes in our cluster easily. The initial installation of HDFS cost us nothing more than the hard drives themselves. HDFS also has proven to be easy to manage and maintain.
The only development needed on our end to make HDFS suitable for our needs was to extend the gridftp software to be HDFS-aware. Analysis jobs are able to access data in HDFS via FUSE mounts. There is continued development on the analysis software to make it HDFS-aware and further remove unnecessary overhead.
HDFS is a block-based distributed filesystem. This means any file that is stored in HDFS is broken into data blocks of a configurable size. These individual blocks then can be stored on any HDFS storage node. The probability of having a hot data server that is serving data to the entire cluster starts to approach zero as the files become distributed over all the worker nodes. HDFS also recognizes when data required by the current node is located on that node and it does not initiate a network transfer to itself.
The block replication mechanisms in HDFS are very mature. HDFS gives us excellent data resiliency. Block replication levels are easily configured at the filesystem level, but also can be specified at the user level. This allows us to tweak replication levels in an intelligent way to ensure simulated data that is created at Nebraska enjoys higher fault tolerance than data we can readily retransfer from other sites. This maximizes our available storage space while maintaining high availability.
HDFS was a perfect fit for our Tier-2.
A student at the University of Nebraska-Lincoln, Derek Weitzel, completed a student project that shows the real-time transfers of data in our HDFS system. Called HadoopViz, this visualization shows all packet transfers in the HDFS system as raindrops arcing from one server to the other. The figure below shows a still shot.
Screens from left to right: Condor View of Jobs; PhEDEx Transfer quality; Hadoop Status Page; MyOSG Site Status; CMS Dashboard Job Status; Nagios Monitoring of Nebraska Cluster; CMS Event Display of November 7 Beam Scrape Event; OSG Resource Verification Monitoring of US CMS Tier-2 Sites HadoopViz Visualization of Packet Movement
Data Analysis
Once the data is stored at a Tier-2, physicists need to be able to analyze it to make their discoveries. The platform for this task is Linux. For the sake of standardization, most of the development occurs on Red Hat Enterprise-based distributions. Both CERN and FNAL have their own Linux distributions but add improvements and customizations into the Scientific Linux distribution. The Tier-2 at Nebraska runs CentOS as the primary platform at our site.
With data files constructed to be about 2GB in size and data sets currently hovering in the low terabyte range, full data set analysis on a typical desktop is problematic. A typical physics analysis will start with coding and debugging taking place on a single workstation or small Tier-3 cluster. Once the coding and debugging phase is completed, the analysis is run over the entire data set, most likely at a Tier-2 site. Submitting an analysis to a grid computing site is not easy, and the process has been automated with software developed by CMS called CRAB (CMS Remote Analysis Builder).
To create a user's jobs, CRAB queries the CMS database at CERN that contains the locations where the data is stored globally. CRAB constructs the grid submission scripts. Users then can submit the entire analysis to an appropriate grid resource. CRAB allows users to query the progress of their jobs and request the output to be downloaded to their personal workstations.
CRAB can direct output to the Tier-2 storage itself. Each CMS user is allowed 1 terabyte of space on each Tier-2 site for the non-archival storage of each user's analysis output. Policing the storage used by scientists is a task left to the Tier-2 sites. HDFS's quota functionality gives the Nebraska Tier-2 administrators an easily updated tool to limit the use of analysis space automatically.
Figure 3 shows a simulated event seen through CMS, and Figure 4 shows an actual record event.
Figure 3. How a physicist sees CMS—this is the event display of a single simulated event.
Figure 4. An actual recorded event from CMS—this event shows radiation and charged particles spilling into the detector from the beam colliding with material in the beam pipe.
A Grateful Conclusion
The LHC will enable physicists to investigate the inner workings of the universe. The accelerator and experiments have been decades in design and construction. The lab is setting new benchmarks for energetic particle beams. Everyone I talk to about our work seems to get glossy-eyed and complain that it is just too complex to comprehend. What I want to do with this quick overview of the computing involved in the LHC is tell the Linux community that the science being done at the LHC owes a great deal to the contributors and developers in the Open Source community. Even if you don't know your quark from your meson, your contributions to open-source software are helping physicists at the LHC and around the world.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Speed Up Your Web Site with Varnish | Jun 19, 2013 |
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Linux Systems Administrator
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- Non-Linux FOSS: libnotify, OS X Style
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Android's Limits
- Reachli - Amplifying your
1 hour 2 min ago - excellent
1 hour 51 min ago - good point!
1 hour 53 min ago - Varnish works!
2 hours 3 min ago - Reply to comment | Linux Journal
2 hours 32 min ago - Reply to comment | Linux Journal
4 hours 58 min ago - Reply to comment | Linux Journal
8 hours 58 min ago - Yeah, user namespaces are
10 hours 14 min ago - Cari Uang
13 hours 46 min ago - user namespaces
16 hours 39 min ago





Comments
Great article!
Thank you for that article, I love reading about the LHC, and to cover linux too? Cool!
Ah, the LHC is so cool, this review wouldn't be complete if we didn't mention the LHC music video:
The Large Hadron Rap:
http://www.youtube.com/watch?v=j50ZssEojtM
The LHC rocks!
graphs
Loved the article. Very interesting to see some hardcore usage of Linux.
I have a simple question ... What program was used to create the bar charts in the article?
re: graphs
I did not generate the graphs by hand, the graphs are generated by a service called 'gratia'. Gratia is the accounting done within OSG. IIRC, the bulk of the graph making is done by 'graphtool'.
http://t2.unl.edu/documentation/graphtool.
Carl
Great read...
however.. this does beg the question.. with all this FOSS software being used.. and modified to suit their needs.. are they also contributing back their modifications for the good of all..? i quote the following..
"Both CERN and FNAL have their own Linux distributions but add improvements and customizations into the Scientific Linux distribution. "
apart for that.. it is great that they are taking advantage of the best software available to them..
ROOT
On the data analysis end of things, CERN / Fermilab have developed ROOT, a C++ based data analysis framework. Whilst it mainly caters for the needs of particle physics data analysis, it is a very powerful framework and I've heard of people in insurance and finance industries using it for data mining. See root.cern.ch
Re: Great Read....
"are they also contributing back their modifications for the good of all..?"
A good question, but one I didn't think to address more clearly in the article. We certainly try to give back where we can. The experiments employ numerous developers and they, when allowed by their campus/lab MOU pass patches back upstream. I also know of numerous patches Nebraska has submitted to Hadoop and Caltech has worked hard on packaging (RPM) HDFS (although I don't personally know if the Hadoop project adopts them). (IIRC, my counterpart at Caltech is on the Fedora RPM team. Don't quote me on that.)
The network infrastructure that's being built and researched for the LHC almost has to benefit everyone. The connectivity between North America and Europe has improved at a much faster rate due to the demands of the LHC.
Our community tries it's best to give back to the code bases we use in the form of patches and improvements as much as we can. I don't know if we could/should do more, but I'm personally satisfied that we aren't exploiting FOSS.
Cheers.
CERN
I used to work at Brookhaven National Laboratory and can explain the situation further. First of all most linux machines in the LHC and ATLAS clusters run scientific linux.
http://en.wikipedia.org/wiki/Scientific_linux
It's a centos based derivative created by CERN. There's MANY MANY open source tools in use at almost ALL labs. Argonne actually created a tool called big config (http://en.wikipedia.org/wiki/Bcfg2) and JLAB has also written and open source much software. Condor is an open source queue management software used for cluster job submission. This is also used and patches submitted upstream.
Actually, Scientific Linux is
Actually, Scientific Linux is Redhat-based and is created/supported jointly by Fermilab and CERN.
Well, they invented the world
Well, they invented the world wide web for one ;-) http://en.wikipedia.org/wiki/Www#History
Excellent Article!
Thoroughly enjoyed reading through this, thank you for the glimpse into the goings on at the LHC and its experiments. Nice to see oss being used in such an important and exciting experiment :)
open HPC software stack
Great read! I'm excited to see the tools used and the choices made on such a data intensive project. Wrestling with many of these same issues on a smaller scale, it is nice to see just how far these tools will scale.
I'll be honest with you...
After reading this, I have come to a decision: I believe that man has begun to think too much!
...I'm not just a "troll", but also a subscriber!
think too much
Hey J, we're all Devo!