• John
  • Felde
  • University of Maryland
  • USA

Latest Posts

  • USLHC
  • USLHC
  • USA

  • James
  • Doherty
  • Open University
  • United Kingdom

Latest Posts

  • Andrea
  • Signori
  • Nikhef
  • Netherlands

Latest Posts

  • CERN
  • Geneva
  • Switzerland

Latest Posts

  • Aidan
  • Randle-Conde
  • Université Libre de Bruxelles
  • Belgium

Latest Posts

  • TRIUMF
  • Vancouver, BC
  • Canada

Latest Posts

  • Laura
  • Gladstone
  • MIT
  • USA

Latest Posts

  • Steven
  • Goldfarb
  • University of Michigan

Latest Posts

  • Fermilab
  • Batavia, IL
  • USA

Latest Posts

  • Seth
  • Zenz
  • Imperial College London
  • UK

Latest Posts

  • Nhan
  • Tran
  • Fermilab
  • USA

Latest Posts

  • Alex
  • Millar
  • University of Melbourne
  • Australia

Latest Posts

  • Ken
  • Bloom
  • USLHC
  • USA

Latest Posts

Seth Zenz | Imperial College London | UK

View Blog | Read Bio

Wrestling with the Grid

This being my first entry, I suppose I ought to start with what I do and put it into context—you’ll have to bear with me, because this will take a minute. First, the preliminaries: I’m a fourth-year graduate student at the University of California, Berkeley, working on ATLAS and currently based at CERN in Geneva, Switzerland. I work primarily on testing the offline software.

If you’re a regular reader of the US/LHC Blogs, then that last paragraph made sense to you, except maybe for the last two words. Offline software is the part of the experiment you hear the least about, probably because it’s the hardest to explain. It’s the collection of programs and tools that connects the information that comes from the detector to the physics we’re really interested in, so you have to know about both those things to see what it’s all about. Fortunately, from what I can tell, if you’re a regular reader here then you’re in pretty good hands already as far as the physics and the detector go. So let me give an extremely brief survey the software challenges faced by the ATLAS detector, and then connect it to my own work at the end.

The first and most daunting computing challenge faced by the ATLAS detector is the vast discrepancy between the 40,000,000 potential collisions per second and the 100 or so events that can be stored permanently during that second; this job that is handled by the trigger system, which looks for the collisions that will be most interesting for the physics that we want to do. The first part of this system is entirely hardware-based, but the higher levels run on Linux farms. The data for all this processing has already passed through hardware, firmware, and low-level software, both on and off the detector. So there’s a lot of software involved just in turning the signals on the detector and storing them.

None of that is the offline software, though—that comes in after the data has already been stored on tape, and is less urgent in the sense that it will be done within days rather than within seconds. One of the major tasks of the offline software is reconstruction, which is the conversion of stored information from the detector into the real particles that most likely created those signals. (This has already been done, quickly, by the trigger, but is now done with more precision.) For example, the software might combine a series of “hits” in the Inner Detector to make the likely track of a charged particle, then combine this with energy deposited in the electromagnetic calorimeter to identify a possible electron. (Monica has a written more on how information is combined to identify various particles in this entry.) Offline software is also used to simulate the physics of the detector, which is useful now so that we can “practice” our analyses for when the data is ready, and will be useful later in comparing what we actually see in the detector to what we would expect if the Standard Model were exactly right.

The ATLAS detector is going to record a lot of data, and reconstruction and other offline software tasks take a lot of computer time. Where are all these computers? Well, it turns out that no laboratory in the world has anywhere near enough computing power to do the job, so we link them all together in something called the Grid. This collection of computing sites, spread throughout the world, will have the data recorded by the experiment divided between them; when a physicist wants to look at the data, the job is sent to where the data is, which is much more efficient than copying the data to the physicist’s computer (if she even had enough space, which she probably doesn’t). Of course, using this complicated system presents new challenges; a big one is that the job you run could be sent anywhere, so it’s a lot harder to call tech support if the job fails for some reason. In order to deal with this problem, the ATLAS offline software includes job transforms, which are essentially wrappers for our regular software; whereas normally our jobs are configured by python scripts, the transforms take a very limited number of inputs. This lets us be sure that we’re running the job in a “standard” configuration that can be expected to work, so that the Grid’s computer resources can be used efficiently.

Of course, things can still go wrong, and this—at last!—is where I start to come into the picture. Although the experiment’s software developers always test their changes against the latest version of the code, there are several kinds of bugs they can’t catch, including: 1) bugs that only happen in very large jobs, 2) bugs arising because two developers have made incompatible changes at the same time, and 3) bugs that appear only when multiple stages of data processing (e.g. simulation, then reconstruction) are run. This means that we might produce a software release in which one of the “standard” job transform configurations, which should work, actually doesn’t; if we send such jobs to hundreds of machines around the world, and they all crash in parallel, that’s a big waste of time and money! One of the tools we have to guard against this is the Full Chain Test, which I have written and maintained over the last year or so. This is a set of scripts which send a series of large jobs to a few dedicated four-processor machines here at CERN, to make sure as well as possible that everything is working the way we expect before we send things off into the Grid.

So the short version is this: I write programs to run other programs and make sure they work. Or, as I often tell my friends and family, I sit in front of the computer all day.

Seriously, although this is not the most glamorous work, I’m very happy with the project, because:

  1. It’s important. It’s used routinely by the experiment’s software management to ensure that our releases our good, and it actually finds problems that save us time—which means that I’m helping make sure that our offline software is ready to run when the detector is.
  2. It’s self-contained. I have a specific set of things to be tested, but the details of the implementation have been mostly up to me, so I’ve learned quite a bit.
  3. It’s done, except for a bit of documentation, and hopefully I’ll be able to pass on the routine maintenance to someone else.

That last item is especially important, because as a student I have a lot to learn and only so much time—and a wise professor once told me that once I get good at something, that means it’s time to move on. I have a thesis project to prepare for, and I’m hoping to work more directly with the detector once the commissioning of the Pixel Detector starts in earnest. But more on those things later.

Share

Tags: , , ,