• USLHC
  • USLHC
  • USA

Latest Posts

  • Frank
  • Simon
  • MPI for Physics
  • Germany

Latest Posts

  • Aidan
  • Randle-Conde
  • USLHC
  • USA

Latest Posts

  • TRIUMF
  • Vancouver, BC
  • Canada

Latest Posts

  • Richard
  • Ruiz
  • UW - Madison
  • U.S.A.

Latest Posts

  • Byron
  • Jennings
  • TRIUMF
  • Canada

Latest Posts

  • Seth
  • Zenz
  • USLHC
  • USA

Latest Posts

  • Anna
  • Phan
  • USLHC
  • USA

Latest Posts

  • Alexandre
  • Fauré
  • CEA/IRFU
  • FRANCE

Latest Posts

  • Jim
  • Rohlf
  • USLHC
  • USA

Latest Posts

  • Zoe Louise
  • Matthews
  • ASY-EOS
  • UK

Latest Posts

  • Ken
  • Bloom
  • USLHC
  • USA

Latest Posts

Posts Tagged ‘computing’

This past Monday we had our annual US CMS Tier-2 computing workshop. Once again, we held our workshop as part of the Open Science Grid All-Hands Meeting. Those of you who have been reading the blog for more than a year will remember that last year this meeting was held at the totally neat LIGO facility in Louisiana. This year the meeting was at totally neat…Fermilab! OK, I’ve been to Fermilab before, so no travelogue this time, but as usual it was good to meet so many collaborators face to face.

I don’t want to jinx ourselves, but I’m feeling pretty good about the state of the computing for the experiment right now. As we reviewed the status of the seven CMS Tier-2 sites in the United States and two in Brazil, we generally saw that everyone is operating pretty stably and happily. A year ago, there was a lot of discontent with existing large-scale disk storage systems. But since then we’ve developed and implemented some new systems, and there have been a lot of improvements in the existing systems, so it all just looks a lot better.

That being said, this all just dress rehearsal — we’ll see how it really goes when thousands of physicists start using the system to do hundreds of data analyses. Now that the LHC running schedule has been defined for the coming three years, we have a much better handle on the needed computing resources for for this period. Overall, we’re going to be running at lower collision rates than previously anticipated, but with pretty much the same livetime. This means that we’ll be recording the same number of events we would have at higher collision rates, implying that the density of interesting physics will be smaller. It creates a more challenging situation for the computing, but at least we now know what has to be done, and have a reasonably good idea of how to get there.

As for the second half of the title — the real excitement was on my trip home. I had an 8:10 AM flight out of O’Hare, which would arrive in Lincoln around 9:40, giving me plenty of time to be ready for my 12:30 PM class. But there was fog in Chicago, and an aircraft was late, and then the crew was swapped, and then the aircraft was sent to Peoria instead while we waited for the crew, and in the end we didn’t leave until around 10:45. The plane touched down on the runway in Lincoln at 11:57. And I was in my classroom just on time. Ah, lovely Lincoln, where the airport is small, you park right next to the airport, and you can drive to campus in minutes!

Local news

Friday, November 27th, 2009

Admittedly, it is a little harder to follow all the LHC excitement if you are here in the US rather than at CERN.  The announcement of first collisions on Monday came while I was teaching my class, and I’ve been trying to piece together the whole story by talking to our people over there and reading the slides from various meetings.  Of note was a public meeting at CERN yesterday (yes, Thanksgiving Day, another impediment if you are in the US) with presentations from Steve Meyers, CERN’s director for accelerators, and the four LHC experiments.  See the slides and video here.  As everyone else has been saying, the past week has been a thrill (or at least a vicarious one!) for the LHC, the four experiments on the ring, and really all of HEP.  Check out Meyers’s slides in particular, where he documents just how far we have come in the past fourteen months.  The experiments have turned around information from these first few collisions very quickly; some detectors are already able to reconstruct decays of the neutral pion, for instance.  We have huge expectations for the next set of collisions and then for the increases in collision energy that will follow.

My particular contribution to CMS has been in computing, and I’m happy to say that all of that has gone quite smoothly so far.  The prompt reconstruction of events went off without a hitch, and data was flowing very quickly out of CERN to the Tier-1 and Tier-2 sites.  We soon lost track of how many sites had copies of the collision data, and now we’re seeing plenty of people use the distributed computing system to analyze it.  When the next round of collisions comes, we’ll be ready to do it all again.

So while it’s hard to follow the news up to the minute, I’m still connected to the start of a great particle physics adventure.  I’m trying to drag the rest of Nebraska along with me — we managed to get a release placed in the local paper, and if you read this post soon enough, you can hear me at 8:30 AM Central time on Saturday 11/28 on KZUM, Lincoln’s community radio station.  I’ve already taped the interview; let’s hope I didn’t sound incoherent!  (At least when I type the blog posts, there is a backspace key….).

October, exercised

Friday, October 9th, 2009

Here at CMS, we are in the midst of something that, I guess for lack of a better name, has been dubbed the “October exercise.” For the past week and the week to come, we have been trying to get as many people as possible to use the distributed computing system just as they would if they were doing a real analysis with real data. A new set of simulations have been released, and people are trying to work them through the system and their data analyses as quickly as possible, to demonstrate the turnaround time and the scale at which we will be hammering the computing clusters that are distributed around the world.

Halfway through, I would have to consider this at least something of a success. I don’t have anything resembling an accurate count of how many people have gotten involved, but it seems that we are seeing lots of people who had been just been doing their data-analysis work on local computing clusters now trying to use the grid for the first time. Tens of individual exercises have been designed by the dozen-ish CMS physics groups, each with multiple steps involving processing, writing and transferring data. As someone who has been working on the distributed computing for some years now, it is encouraging to see so many new people try out the system, and be successful more often than not.

On the other hand, it’s not as if everything has gone perfectly. A number of new tools and rules were developed just in advance of the exercise, and running these things out of the box at scale has been a bit bumpy. We were certainly aware of the weaknesses in the system, but now they are on full display. One thing that has proved particularly challenging is the “staging out” of outputs made by users in their processing jobs. In CMS computing, different datasets get distributed to different computing sites, and physicists who want to run on those datasets send their jobs to those sites. But everyone has a “home” site, and the output of the jobs has to be returned to the home site. This means that the data must be transferred from a somewhat random site X to the user’s site Y, and not every site Y can handle the volume of transfers that might be coming in. We’re keeping an eye on this and thinking about how we can improve it in the future.

After a week of this, I’d have to say that it’s somewhat exhausting to try to keep up with all that’s going on. And we don’t even have data yet — how exhausted will I be then? But on the flip side, I’m glad that we’re learning all of these lessons now, rather than a month or two from now.

Not a day at the beach

Saturday, April 25th, 2009

Only two weeks left until the end of the academic year! This is always a very busy period, which is my excuse for not writing anything recently. Very little academic business gets done around the university during the summer, so all sorts of things need to get wrapped up before we get to the end of the term, and there are always so many year-end events for our students too. And of course I still have my class to teach; this is going farily smoothly, but I will probably need every last minute in the next two weeks (or at least until I have prepared the final exam) to bring it to a happy ending.
As it happens, I also have a cluster of research-related travel right now — not helpful for getting my teaching done, but it gives me something to write about. I spent some of this week in San Diego, where those of us working on CMS software and computing gathered to discuss the state of the world. These meetings are more typically at CERN, but someone (I’m not even sure who, actually) came up with the brilliant idea of doing them next to an ocean this time instead. That’s great for me — not the ocean part, so much, but it’s always a challenge for me to get to CERN, what with the long distance and the fact that it’s hard to go for less than a week. For these meetings, I was able to teach on Tuesday morning and catch a flight here that night, and still attend most of the workshop.
As has been true for some time, the question we have been struggling with is are we ready for the start of the LHC, and if not what do we have to do to get there. I think that the greatest value of this meeting (heck, any meeting, I suppose) was to bring together groups of people who don’t usually talk. It turns out that there were cases of people working on different aspects of particular problems who had very different understandings of some of the issues. For instance, there was a dispute over whether “24 hours” actually meant 24 hours, or something more like 48 hours. And in some cases, one group of people didn’t know about work that another group was doing that could in fact be very useful to the first group. In short, there’s nothing like actually getting people in the same room to explain themselves to each other.
But once again, I was struck by just how complicated this experiment will be. The challenge from the computing perspective is how interconnected everything is. We want to make sure that a user can’t do anything that could essentially knock over a site (or possibly the whole distributed computing system) by accident. Certainly there were times in the meetings when someone would ask, “why do we have to make it so hard?” but honestly, sometimes it just is that hard.
Anyhow, next week I’ll be in Denver for the April general meeting of the American Physical Society. I’ll write about it then…much more physics content, I promise!

Why is computing interesting?

Friday, July 11th, 2008

Given the tedium of what I need to deal with day to day on the computing, what is it that makes computing interesting?  Let me make a comparison with what is going on in the collision halls.  My colleagues underground at CERN are working very hard as we head towards LHC startup.  There are some very tight time constraints at this point, and they are working with very complex systems that are pushing the limits of their technologies.  And as we head into these final weeks, the separate systems that have been under development for years must be integrated into one large experiment.  It’s a tremendous task, and I don’t want to take anything away from what they are doing.

However, they are starting to get out of the woods.  The door to the collision hall will be shut at some point, and very little can be changed after that.  And the number of people who will interact directly with those systems is relatively small; a team of experts, who will continue to make a lot of effort to make their hardware work and keep it running happily.  Most of their work will be hidden to the world; physicists will be happy to see lots of silicon hits on tracks, but they will only have a vague idea of how much labor went into that.  (I’ll say again, the hardware guys are under-appreciated!)

In contrast, just about everyone on CMS will interact with the computing in some way, which means that my problems are just beginning.  Everyone will want to know where the datasets are.  Everyone will be trying to submit jobs.  Everyone will be trying to make plots.  Performance will be documented and updated regularly on Web pages.  This means that everyone will have an opinion on what works well and what doesn’t, and they won’t hesitate to voice it.  And all the computers are above ground, and software can be modified with a few keystrokes; we can tweak things endlessly, and we might well be called upon to do so.

So in fact this is a very human enterprise — we are building a system that 2000 motivated, smart and creative people will be using every day.  We need to make it work for each of them as individuals, while also making sure that the group as a whole is not harmed.  And while ultimately we have to build good systems, there is a lot of psychology and sociology involved too.  Everyone needs to actually buy in to the idea of distributed computing for it to work, which might be hard while we still work through all the kinks, and everyone will need to trust that they are being treated fairly.  One of my mentors said to me once, “If all of our problems were physics problems, this job would be easy.”  She was of course referring to the fact that we must work with people every step of the way.  Physics equations and plots are interesting, but the human aspect of the work adds an extra dimension.

It is on my mind today because I have been corresponding with some users who are having trouble running jobs on our site.  It sounds like there could be any number of things going on…many of which may have nothing to do with the performance of the cluster here.  But it doesn’t matter; I’m invested in getting the entire chain working, because we have to build confidence.  More to come, I’m sure.

Not a cabling exercise

Wednesday, June 11th, 2008

Despite all of our high-minded talk about new physics discoveries, lots of nitty-gritty work has to be done in the trenches. The day to day work of particle physics can typically be characterized either as a cabling exercise (building the detectors) or a bookkeeping exercise (analyzing the data). Since most of the other bloggers are writing about their cabling exercises, I’ll write a bit about a bookkeeping exercise.

The Tier-2 computing site that I help to manage currently hosts about 150 TB of data, in about 200 datasets, which in turn consist of about 135,000 individual files. These files are in turn spread over many different disks in about 20 different storage servers. When someone submits a job to our site, they specify which files they want to analyze. We have to then locate those files and make sure the jobs can actually work on them. Bookkeeping, indeed.

When we copy datasets here from Tier-1 sites, we use a tool that looks up a database of what files exist and where they exist, and then tries to transfer all of those files here.  This transfer tool, called PhEDEx (a name both beautiful and horrible), then records which files are at our site.  It also writes this information into a different record that a tool called DBS can look at.  When users submit jobs, they check against DBS to locate datasets.  So, this means that PhEDEx needs to be in synch with DBS (and vice-versa), or else user jobs will come here and look for data files that we don’t have.  And of course what is actually on the disk needs to be consistent with what PhEDEx thinks we have on disk.

This is what I have been chasing around for the last week or so.  We did some careful checking and discovered that PhEDEx thought we had about 2% more files on disk than we actually had.  That might not sound so terrible, but it is 4 TB, not a small amount.  That we got fixed by telling PhEDEx that we didn’t actually have those files; it then went and re-transferred them for us, without incident.  However, PhEDEx and DBS disagree with each other too, I suspect by an even larger amount.  We’re still trying to figure that one out.

I decided that one way to reduce the number of inconsistencies is just to reduce the number of datasets that we host.  Heck, how did we end up with 200 datasets here anyway, and who exactly wanted to use them?  I trolled through a lot of old emails and other records, and now have a picture of how we got here.  What I still need is to be able to understand how actively each dataset is being used.  If no jobs are actually looking at a given dataset, I can delete it without anyone being offended.  (That’s how I got down to the 200 I’m quoting…I was able to knock off the easy ones.)  We need to develop some better tools for this.

All this is nothing compared to trying to understand how the funding that supports this and other work of our research group here actually flows to us.  That’s been another recent project.  It’s a bookkeeping exercise, for sure, but I wouldn’t call it particle physics.

Tiers on my pillow

Friday, May 23rd, 2008

And now, the long-promised explanation of the CMS distributed computing system. (I know, you have been on the edge of your seats all this time.)

Let’s start by considering boundary conditions. First, the LHC will produce a lot of data. Every year, the CMS detector will produce something like a petabyte of raw data. A petabyte is a million gigabytes, and if I did the calculation right, if stored on a set of DVD’s, they would stack up twice as high as the Nebraska state capitol, a famously tall building (if you know your Nebraska). This data needs to be processed (which usually means adding more information to it, making it bigger), stored and analyzed. On top of that there is an even larger amount of simulated data — if you are looking for new physics, you have to simulate it first so you know exactly what detector signatures you are looking for. Thus, we are talking many petabytes of data per year that we must work with.

Second, you may not notice this while tapping on your laptop, but computers require a significant amount of power and cooling for their operation. This has become a constraint on operating data centers; last year I went to a conference on computing in high-energy physics, and the whole week ended up being about power and cooling. (Yes, I was able to stay awake.) No single site can deploy enough power and cooling to support all of the computing needed for CMS data processing and analysis.

So, our answer is to run a highly-distributed computing system, with centers distributed around the globe. Now, this does present significant organizational challenges, but it also allows us to make use of computing expertise in every CMS country, and also gives people a sense of ownership — my vice-chancellor for research was much more interested in helping to pay for computers in Nebraska than he would have been to send computers to Switzerland.

To keep the system manageable, we’ve imposed a tiered hierarchy on it. Different computing centers are given different responsibilities, and are designed to meet those responsibilities. (“Design” here means how much CPU or disk they have, and what sort of networking requirements, etc.) A too-cool-for-school graphic showing how the whole thing works can be found here. The Tier-0 facility at CERN receives data directly from the detector, and it reconstructs events and writes a copy of the output to tape. This may not sound like much, but it saturates the resources that are available at CERN.

Data is then transferred to Tier-1 centers. CMS has seven of these, in the US (at Fermilab), the UK, France, Spain, Italy, Germany and Taiwan. These centers store some fraction of the data that come from CERN, and as we gain a better understanding of our detector behavior and of how we want to reconstruct the data, they also re-reconstruct their fraction of the data every now and then. They also make “skims” of these events — a particular physics measurement typically relies on only a portion of all the collisions that we record, so we split the data into different subsamples that will each be enriched in certain kinds of events.

Note that in all this no one has yet made a plot that will appear in a journal publication! This starts to happen at Tier-2 sites; that’s where skims get placed for general users to analyze them. There are about forty of these sites spread over five continents, and they are also responsible for generating all of that simulated data mentioned earlier. This makes the Tier-2 sites very diverse and dynamic facilities — they are responsible to many different people trying to do many different things.

I have surely rambled on enough for a single posting, so some other time I will write about some of the particular challenges we face in making this system work. Suffice it to say that I spend a lot of time thinking about it. I try not to let it keep me up at night, but sometimes the title turns out to be true. Sorry, I needed to come up with a title for this post, and while “Trail of tiers” was more appropriate, it also has negative connotations in Native American history.

Progress we take for granted

Monday, May 19th, 2008

Just for fun: One of the fun things about working at MIT is that you have a nice perch to observe the progress of technology. I was wandering around the MIT museum with the kids and came upon this:256 kB!

This object is about 50 years old and approximately 1 cubic meter in size if I remember correctly. Anyone want to guess what it is?

Make no little plans?

Friday, May 16th, 2008

This week, US CMS held a “run-plan” workshop at Fermilab. The goal of the workshop was to really get a grip on what needs to be done when the LHC starts running and CMS starts taking data. Did we meet this goal? Do we actually have a plan now? Well, at the very least we have a better picture of what’s going on, and for someone like myself, who sits in Nebraska and spends most of his time thinking about computing, it is helpful to get the broader view. Here’s a sampler of some of the things going on:

  • As you can read from some of the other posts on this site, there is a tremendous amount of work going on with the detector. We recently completed several days of data-taking with as much of the detector as we can, but with no beam (of course!) and no magnetic field. Even that is a huge effort; getting all these pieces of the detector working at once is quite complicated. And this is not just an operational exercise — the data that were recorded are potentially quite useful. Yes, we recorded a whole lot of nothing, but if you analyze that, you ought to observe…nothing. If instead you see something, then there is some detector effect going on that can contaminate beam-collision data, such that you would see something when you ought to see nothing. And when you are looking for new physics, and you don’t quite know what it’s going to look like, then nothing that looks like something is going to be a lot of trouble. One thing we hope to do is superimpose these “empty” events on top of simulations of “real” events, and see how badly our simulations degrade as a result.
  • I spent most of my time in a working group focusing on computing issues. The most interesting presentation we had was from a student who has been busy using the computing system for several months. He of course has found ways to get his work done most efficiently…which were not necessarily the ways we imagined people using the system! It was great to what he and others find to be the most difficult things to do; we came up with some ideas for improvements that can be made. On balance, though, the system is working pretty well, even if we still have further to go.
  • No one said that they had too many people working on a project. Everything still needs more effort. It’s encouraging in that any help that is offered will be welcomed.

I gave a couple of presentations at the workshop, one on what tasks have calls on the resources of Tier-2 centers, and one on some of the issues we need to think about in analyses involving leptons plus jets in the final state. These went well enough. More importantly, by coming to the workshop I had a chance to see some of my friends and colleagues face to face. Video conferencing is OK, but you can learn a lot by chatting in the cafeteria. There are some physics things that I really do want to get going on, especially now that the summer is here, and I spoke to a few far-flung collaborators who want to launch similar efforts. We all agreed to phone and email and so forth. One colleague emphasized to me that we must really seize the day now. I knew this already, but it was reinforced — the next few years will be a unique time in my entire scientific career, which still has a few decades to go, so I should make the most of it.

The title of this posting comes from Daniel Burnham, who was the principal planner for the layout of the city of Chicago, one of our great American cities; he believed that every resident should be within walking distance of a park, and decreed that the lakefront should always be free and accessible to the public. “Make no little plans,” he said, “They have no magic to stir men’s blood and probably will not themselves be realized.” It worked out well for Chicago; let’s make it the same for the LHC.

A new record, alas

Friday, May 9th, 2008

I typically start my morning at home by scanning my overnight email while I eat breakfast. (This sets a horrible example for my daughter; I will have to stop when she becomes more cognizant of what I’m doing.) When you have 2000 collaborators, and most of them are seven time zones ahead of you, there is usually some amount of mail to get through, so I like to get a jump on it before heading to the office.

On Tuesday of this week, I believe I set a new record — there were 82 new mail messages waiting for me and my Cheerios. (They’re actually the generic store-brand cereal, not the name-brand.) Now, admittedly many of these were the skim-and-delete types. (All of you people out there who are having semi-private conversations and cc’ing everything to some mailing list — please stop. It just makes me cranky, and the O’s get increasingly soggy as I hit the delete key over and over.) But some of them needed further contemplation, which stretched well into the work day.

Why now? Part of it was the recovery from the May Day holiday weekend; as people came back to work in Europe, they had a lot to catch up on themselves. But a lot of it was the computing challenges that are now underway. These startup phases are always challenging; all sorts of technical things haven’t been tested at scale, and not all sites have completely gotten the message on what they are supposed to be doing, and sometimes there are policy issues that haven’t been thought out yet either. The good news, however, is that there has been a lot of good performance out there. We have about 30 Tier-2 computing sites (I know, I haven’t explained the tier system yet) participating — about as many as I could imagine — and by and large things are working. There are a number of sites that have definitely exceeded my expectation for how many jobs they could handle and how many of them would finish successfully. (I’m not going to name names, because I don’t want to embarrass sites that I had low expectations of!) The unfortunate exception has been my own cluster at Nebraska. It’s been a tough week for us, as we’ve been fighting multiple problems and arbitrating among various demands on the system. It took a while for the challenge jobs to start running, and when the did, 98% of them promptly failed. The important thing is that we understand why, so that we can be more successful the next time around, and it sounds like our admins are gaining on that. But at the same time, I feel like we just got caught with our pants down.

Through the wonders of the Internet, I am able to follow this (and annoy our admins with questions) while far from home. This weekend I find myself in Kalamazoo, MI, where my wife is attending the rather huge annual congress on medieval studies that Western Michigan University hosts. There are at least 3000 medievalists here, a bigger turnout than we typically get for the biggest particle physics conferences of the year. My job for the weekend is looking after my daughter, who is attending her first Kalamazoo meeting. Our hotel is one of the kinds with breakfast included. I’ve been leaving the computer in the room, to be polite. No O’s among the breakfast selections.