Are you addicted to YouTube? No, I wouldn’t say that about myself, but gosh, it’s rather amazing what you can find on there. At home with the kids lately, we’ve been looking at classic bits of The Electric Company, the 1970’s Children’s Television Workshop educational show which spans the period of late Tom Lehrer to early Morgan Freeman. Part of what makes YouTube great is that it’s so easy to use. You put a phrase into the search window, and some computer somewhere (don’t ask me where) quickly finds the data that you are looking for. Then you just click a button and the videos come streaming onto your computer, without a whole lot of effort from you. You don’t have to know what computer disk the file resides on, or the directory structure of that computer. For all you know, the video might be coming from several different computers at once, with the source being adjusted in real time to give the best streaming performance.
Now, compare that to how we go about getting our data in particle physics experiments. Back in the day, you definitely had to know the exact directory and exact file names of the dataset that you wanted to analyze, and then carefully type that into your computer programs. A single typo could destroy hours or days of computing effort. We’ve largely gotten past that — we have better technology for file catalogues, such that you can just specify the name of a dataset, and all the file names will be looked up for you. But we are still largely constrained by “data locality,” the requirement that your analysis program must be running on a computer in the same room as the computer that has the disk with your data on it. This constraint leads to a variety of optimization problems. What if a dataset gets popular all of a sudden — are there enough processing resources in the right place to handle the demand? Can you get more copies out to the bigger processing centers quickly? Are you then under-using other centers and letting CPU cycles go idle? If you want to run on a given dataset, you might know which computing sites have that data, but how do you know which has the most available resources right now? And finally, what if data at a site gets corrupted? Will all the jobs running in that computer room start failing? Needless to say this doesn’t sound like YouTube at all.
I and some colleagues are working on a project that tries to change this. We’ve called it “Any Data, Anytime, Anywhere,” as our goal is to make it as easy to access LHC data as it is to access a YouTube video. At the heart of the system is a “redirector,” a system that serves as a giant index of files that reside at computing sites all over the country. A computer program asks the redirector for a file, the redirector finds an optimal source for the file, and the program then reads the file from that source, without the user having to know where the file actually is. That means that the source could be thousands of miles away, and the only way for the remote reading to be efficient is for it to be nearly as fast as reading from a computer in the same room, so some effort has gone into making that happen. Once you have removed the data locality requirement, all sorts of things are possible. If a file is corrupt at one site, it could introduce a fallback mechanism so that a read failure results in an attempt to get the same file through the redirector instead. If a particular site gets overloaded with jobs, we could start to migrate them to a less busy site, even if that site doesn’t actually have the data that the jobs want; they can be obtained through the redirector instead. That could lead to a better global balancing of supply and demand for resources. While we imagine that it’s computers at CMS institutions that will be reading the data, there’s nothing to stop any computer anywhere from reading the data, even if it is not part of CMS. That could really fulfill the promise of grid computing — if we can borrow a computer for a few hours, we can use it to analyze CMS data even if that computer starts out knowing nothing about CMS. It also gives us a straightforward way to use cloud-computing resources, if that were to turn out to be cost effective.
And on top of all that, what stops this from being limited to the LHC? Many disciplines have large datasets that need to be analyzed by distributed teams of scientists. In principle, they could use the same infrastructure. We’re hoping that this technology could eventually be used across the sciences and even into emerging fields like digital humanities. If that were to happen, then researchers from all sorts of disciplines could consider themselves Easy Readers, at least as far as their data is concerned.