Despite all of our high-minded talk about new physics discoveries, lots of nitty-gritty work has to be done in the trenches. The day to day work of particle physics can typically be characterized either as a cabling exercise (building the detectors) or a bookkeeping exercise (analyzing the data). Since most of the other bloggers are writing about their cabling exercises, I’ll write a bit about a bookkeeping exercise.
The Tier-2 computing site that I help to manage currently hosts about 150 TB of data, in about 200 datasets, which in turn consist of about 135,000 individual files. These files are in turn spread over many different disks in about 20 different storage servers. When someone submits a job to our site, they specify which files they want to analyze. We have to then locate those files and make sure the jobs can actually work on them. Bookkeeping, indeed.
When we copy datasets here from Tier-1 sites, we use a tool that looks up a database of what files exist and where they exist, and then tries to transfer all of those files here. This transfer tool, called PhEDEx (a name both beautiful and horrible), then records which files are at our site. It also writes this information into a different record that a tool called DBS can look at. When users submit jobs, they check against DBS to locate datasets. So, this means that PhEDEx needs to be in synch with DBS (and vice-versa), or else user jobs will come here and look for data files that we don’t have. And of course what is actually on the disk needs to be consistent with what PhEDEx thinks we have on disk.
This is what I have been chasing around for the last week or so. We did some careful checking and discovered that PhEDEx thought we had about 2% more files on disk than we actually had. That might not sound so terrible, but it is 4 TB, not a small amount. That we got fixed by telling PhEDEx that we didn’t actually have those files; it then went and re-transferred them for us, without incident. However, PhEDEx and DBS disagree with each other too, I suspect by an even larger amount. We’re still trying to figure that one out.
I decided that one way to reduce the number of inconsistencies is just to reduce the number of datasets that we host. Heck, how did we end up with 200 datasets here anyway, and who exactly wanted to use them? I trolled through a lot of old emails and other records, and now have a picture of how we got here. What I still need is to be able to understand how actively each dataset is being used. If no jobs are actually looking at a given dataset, I can delete it without anyone being offended. (That’s how I got down to the 200 I’m quoting…I was able to knock off the easy ones.) We need to develop some better tools for this.
All this is nothing compared to trying to understand how the funding that supports this and other work of our research group here actually flows to us. That’s been another recent project. It’s a bookkeeping exercise, for sure, but I wouldn’t call it particle physics.