While avoiding writing the final exam for my course (sorry students, I’m now almost done with it), I stumbled on this article in The New York Times about the problems of a deluge of data in genomics. At this point, genomes can be sequenced much more quickly than they can be analyzed. Indeed, the article reports that there is enough sequencing capacity in the world to fill a stack of DVD’s two miles high with data each year.
Sound familiar? (Including the DVD analogy?) Particle physics experiments face the same problem. At the LHC, we have particle collisions 600 million times per second. The four LHC experiments produce a petabyte of data (a million gigabytes) per second — if we were to keep every bit of data that the LHC produced. Obviously, we don’t do that; the data is heavily filtered by the experiments’ trigger systems, which reduce the data rate to 300-400 events per second per experiment. Now, that will still get you something like 15-25 PB of data per year, and a stack of DVD’s that’s several times higher than that of the DNA sequences. So we have the same problem, if not a bigger one — and that is after we’ve only kept one in a million collisions! Particle physics has long been on the forefront of data-intensive computing.
I’m no biologist, and I won’t claim that I know any more about genomics than I do about soccer or wide-area networking. But it seems natural to ask what (if anything) genomics can learn from particle physics in terms of data management. I could think of two ideas. First, must they really keep all the data? We throw away essentially every collision that happens (but for that one in a million), and can still learn a huge amount of physics. I think that this is to a large extent because we know what is interesting and what isn’t, and know how to throw away the boring stuff. For all I know, it might not be possible in genomics. The data you are throwing away might be the genomes of individual people, and if you really want to understand how one particular person works, you can’t do that. But if you are just looking for trends in the population, maybe you can.
Another idea is, can they make the data any smaller? In particle physics experiments, we do a lot of “zero suppression” up front, just throwing away the information from electronics channels that have nothing to say about a particular event before we even record the data to a disk. Then, when we process the data to estimate the energies and momenta of the particles produced in a given collision, we typically store even less information. The samples we present to analyzers are very compact, essentially down to the momentum vectors, and not carrying all of the channel-by-channel information about each particle. I’ve read that a lot of our DNA is actually “junk” with no impact on how biological traits are expressed. How much of this can be identified in a given genome and then safely be thrown away? Or, if you don’t quite feel safe about throwing it all away, could you just keep it in, say, 10% of the genomes as a safety measure? (For trigger aficionados, this would be a form of prescaling.)
I don’t know the answers to any of these questions, but perhaps the biologists would, or could at least use them to stimulate some new thinking. Just the other day, the boss was telling the intensity frontier workshop that particle physics is part of a fabric of sciences in which different fields make broad contributions to each other. Data-intensive computing could be considered one of the threads that holds that fabric together.
Thanks to Ruth Pordes, executive director of the Open Science Grid, for suggesting this as a blog topic.