• John
  • Felde
  • University of Maryland
  • USA

Latest Posts

  • James
  • Doherty
  • Open University
  • United Kingdom

Latest Posts

  • CERN
  • Geneva
  • Switzerland

Latest Posts

  • Aidan
  • Randle-Conde
  • Université Libre de Bruxelles
  • Belgium

Latest Posts

  • Vancouver, BC
  • Canada

Latest Posts

  • Laura
  • Gladstone
  • MIT
  • USA

Latest Posts

  • Steven
  • Goldfarb
  • University of Michigan

Latest Posts

  • Fermilab
  • Batavia, IL
  • USA

Latest Posts

  • Seth
  • Zenz
  • Imperial College London
  • UK

Latest Posts

  • Nhan
  • Tran
  • Fermilab
  • USA

Latest Posts

  • Alex
  • Millar
  • University of Melbourne
  • Australia

Latest Posts

  • Ken
  • Bloom
  • USA

Latest Posts

Ken Bloom | USLHC | USA

View Blog | Read Bio

Shutdown? What shutdown?

I must apologize for being a bad blogger; it has been too long since I have found the time to write. Sometimes it is hard to understand where the time goes, but I know that I have been busy with helping to get results out for the ski conferences, preparing for various reviews (of both my department and the US CMS operations program), and of course the usual day-to-day activities like teaching.

The LHC has been shut down for about two months now, but that really hasn’t made anyone less busy. It is true that we don’t have to run the detector now, but the CMS operations crew is now busy taking it apart for various refurbishing and maintenance tasks. There is a detailed schedule for what needs to be done in the next two years, and it has to be observed pretty carefully; there is a lot of coordination required to make sure that the necessary parts of the detector are accessible as needed, and of course to make sure that everyone is working in a safe environment (always our top priority).

A lot of my effort on CMS goes into computing, and over in that sector things in many ways aren’t all that different from how they were during the run. We still have to keep the computing facilities operating all the time. Data analysis continues, and we continue to set records for the level of activity from physicists who are preparing measurements and searches for new phenomena. We are also in the midst of a major reprocessing of all the data that we recorded during 2012, making use of our best knowledge of the detector and how it responds to particle collisions. This started shortly after the LHC run finished, and will probably take another couple of months.

There is also some data that we are processing for the very first time. Knowing that we had a two-year shutdown ahead of us, we recorded extra events last year that we didn’t have the computing capacity to process in real time, but could save for later analysis during the shutdown. This ended up essentially doubling the number of events we recorded during the last few months of 2012, which gives us a lot to do. Fortunately, we caught a break on this — our friends at the San Diego Supercomputer Center offered us some time on their facility. We had to scramble a bit to figure out how to include it into the CMS computing system, but now things are happily churning away with 5000 processors in use.

The shutdown also gives us a chance to make relatively invasive changes to how we organize the computing without potentially disrupting critical operations. Our big goal during this period is to make all of the computing facilities more flexible and generic. For the past few years, particular tasks have often been bound to particular facilities, in particular those that host large tape archives. But that can lead to inefficiencies; you don’t want to let computers remain idle at one site just while another site is backed up because it has particular features that are in demand. For instance, since we are reprocessing all of the data events from 2012, we also need to reprocess all of the simulated events, so that they match the real data. This has typically been done at the Tier-1 centers, where the simulated events are archived on tape. But recently we have shifted this work to the Tier-2 centers; the input datasets are still at the Tier 1’s, but we read them over the Internet using the “Any Data, Anytime, Anywhere” technology that I’ve discussed before. That lets us use the Tier 2’s effectively when they might have been otherwise idle.

Indeed, we’re trying to figure out how to use any available computing resource out there effectively. Some of these resources may only be available to us on an opportunistic basis, and taken away from us quickly when they are needed by their owner, on the timescale of perhaps a few minutes. This is different from our usual paradigm, in which we assume that we will be able to compute for many hours at a time. Making use of short-lived resources requires figuring out how to break up our computing work into smaller chunks that can be easily cleaned up when we have to evacuate a site.

But computing resources include both processors and disks, and we’re trying to find ways to use our disk space more efficiently too. This problem is a bit harder — with a processor, when a computing job is done with it, the processor is freed up for someone else to use, but with disk space, someone needs to actively go and delete files that aren’t being used anymore. And people are paranoid about cleaning up their files, in fear of deleting something they might need at an arbitrary time in the future! We’re going to be trying to convince people that many files on disk aren’t getting accessed, and it’s in our interest to automatically clean them up to make room for data that is of greater interest, with the understanding that the deleted data can be restored if necessary.

In short, there is a lot to do in computing before the LHC starts running again in 24 months, especially if you consider that we really want to have it done in 12 months, so that we have time to fully commission new systems and let people get used to them. Just like the detector, the computing has to be ready to make discoveries on the first day of the run!

  • LHC might want to see if, in the interests of scientific discovery, Amazon and Google might be willing to donate some of their expertise to what you’re doing, particularly large data set storage and retrieval.

    Both have world-class skills in just that. Last year I heard one their server gurus speak in Seattle and he said that Amazon now adds more capacity each week than its total capacity in 2000. Both are also skilled at keeping operating costs down.

    And given the sheer size of your data sets, they might also benefit from the experience of learning to manage them.

  • Ken Bloom

    In fact, both US CMS and US ATLAS have had interactions with Amazon to see how easy it would be to use their resources. On the CMS side, we have experimented with making it look like Amazon cloud resources are attached to an existing CMS site, and letting jobs run on those CPU’s while reading the data from somewhere else. ATLAS received a small grant from Amazon to use a larger amount of cloud resources, and they have successfully run 5000 jobs at a time there. (I don’t work on ATLAS, so I don’t know all the details, and thus I didn’t mention it in the original post.) People at Amazon have of course been tracking our progress with this, and we appreciate their interest.