• John
  • Felde
  • University of Maryland
  • USA

Latest Posts

  • James
  • Doherty
  • Open University
  • United Kingdom

Latest Posts

  • Andrea
  • Signori
  • Nikhef
  • Netherlands

Latest Posts

  • CERN
  • Geneva
  • Switzerland

Latest Posts

  • Aidan
  • Randle-Conde
  • Université Libre de Bruxelles
  • Belgium

Latest Posts

  • Vancouver, BC
  • Canada

Latest Posts

  • Laura
  • Gladstone
  • MIT
  • USA

Latest Posts

  • Steven
  • Goldfarb
  • University of Michigan

Latest Posts

  • Fermilab
  • Batavia, IL
  • USA

Latest Posts

  • Seth
  • Zenz
  • Imperial College London
  • UK

Latest Posts

  • Nhan
  • Tran
  • Fermilab
  • USA

Latest Posts

  • Alex
  • Millar
  • University of Melbourne
  • Australia

Latest Posts

  • Ken
  • Bloom
  • USA

Latest Posts

Ken Bloom | USLHC | USA

View Blog | Read Bio

DNA in a haystack

While avoiding writing the final exam for my course (sorry students, I’m now almost done with it), I stumbled on this article in The New York Times about the problems of a deluge of data in genomics. At this point, genomes can be sequenced much more quickly than they can be analyzed. Indeed, the article reports that there is enough sequencing capacity in the world to fill a stack of DVD’s two miles high with data each year.

Sound familiar? (Including the DVD analogy?) Particle physics experiments face the same problem. At the LHC, we have particle collisions 600 million times per second. The four LHC experiments produce a petabyte of data (a million gigabytes) per second — if we were to keep every bit of data that the LHC produced. Obviously, we don’t do that; the data is heavily filtered by the experiments’ trigger systems, which reduce the data rate to 300-400 events per second per experiment. Now, that will still get you something like 15-25 PB of data per year, and a stack of DVD’s that’s several times higher than that of the DNA sequences. So we have the same problem, if not a bigger one — and that is after we’ve only kept one in a million collisions! Particle physics has long been on the forefront of data-intensive computing.

I’m no biologist, and I won’t claim that I know any more about genomics than I do about soccer or wide-area networking. But it seems natural to ask what (if anything) genomics can learn from particle physics in terms of data management. I could think of two ideas. First, must they really keep all the data? We throw away essentially every collision that happens (but for that one in a million), and can still learn a huge amount of physics. I think that this is to a large extent because we know what is interesting and what isn’t, and know how to throw away the boring stuff. For all I know, it might not be possible in genomics. The data you are throwing away might be the genomes of individual people, and if you really want to understand how one particular person works, you can’t do that. But if you are just looking for trends in the population, maybe you can.

Another idea is, can they make the data any smaller? In particle physics experiments, we do a lot of “zero suppression” up front, just throwing away the information from electronics channels that have nothing to say about a particular event before we even record the data to a disk. Then, when we process the data to estimate the energies and momenta of the particles produced in a given collision, we typically store even less information. The samples we present to analyzers are very compact, essentially down to the momentum vectors, and not carrying all of the channel-by-channel information about each particle. I’ve read that a lot of our DNA is actually “junk” with no impact on how biological traits are expressed. How much of this can be identified in a given genome and then safely be thrown away? Or, if you don’t quite feel safe about throwing it all away, could you just keep it in, say, 10% of the genomes as a safety measure? (For trigger aficionados, this would be a form of prescaling.)

I don’t know the answers to any of these questions, but perhaps the biologists would, or could at least use them to stimulate some new thinking. Just the other day, the boss was telling the intensity frontier workshop that particle physics is part of a fabric of sciences in which different fields make broad contributions to each other. Data-intensive computing could be considered one of the threads that holds that fabric together.

Thanks to Ruth Pordes, executive director of the Open Science Grid, for suggesting this as a blog topic.

  • Enceladus

    There are several differences between HEP and biology (I am a biologist!), one of them being the complexity of what we are studying. Of course, I do not mean that HEP is “simple”, I rather mean that there a lot of unifying principles and a lot of things can be derived from a relatively small number of parameters and equations. It is incredibly difficult for example to know for sure whether there is a Higgs boson but at the end, everything will be summarized in just a few bytes of information (“there are n Higgs boson, their charges and weights are … “). Maybe I am wrong but that is the feeling I have as a non-physicist interested in what happens in this field.
    In biology, the things are different given that there are no unifying principles and the information is still huge after data processing. In a Darwinian competition, “breaking a rule” is often an excellent way to gain a decisive advantage. Nearly every dogma in biology has been broken: “DNA is produced from DNA” is wrong because of the reverse transcriptase. “The genetic information is always stored as DNA” is wrong because of the RNA viruses and it even worse if we consider the prion elements. “The genes are always inherited from the parents or from the mother cells” is wrong because of the horizontal transfers. And the “junk DNA” is maybe no junk after all, the last ten years have shown that this “junk DNA” is essential in several regulation processes or in the defense against viruses or transposons. The consequence is that even if it looks like junk, it does not mean it is junk. It just means that so far, we could not attribute any function to this piece of DNA. Maybe it has none, maybe it has one. Throwing away some information is dangerous.
    However, the biggest problem is not really with the data storage: as you said, it is huge but still significantly smaller than the amount of data produced at the LHC. The problem is rather with data processing which is extremely difficult. A genome is not really useful information by itself. A genome is just a way to store some information for producing proteins and RNAs, which are most of the time the useful elements. Of course, it is quite easy to predict the location and sequences of proteins and RNAs. However, the “useful” informations are : what is the shape of the RNAs/proteins, what are they doing and when, with which other proteins/RNAs/others are they interacting and what are the consequences… This “useful” information is present in the DNA sequence but we are unable to predict this useful information from the DNA information. There two steps we are unable to compute for now : predicting the shape of a protein/RNA from its sequence and predicting the function from the shape. We have to do that experimentally which can require lots of people for years for a single gene and we have 30000 of them.
    So here is a problem, genetic data accumulate quite fast now but we cannot do that in silico and the experiments are very slow. There are probably some very interesting genes in penguins, in Amazonian mushrooms or in some unnamed bacteria swimming in the Indian ocean but we simply cannot process it. Of course, most of the penguin genes have a counterpart in humans or at least in other birds but you would expect some funny genes in an animal living in such a harsh environment.

  • Stephen Brooks

    Perhaps one of the few data compression methods that might works is “deduplication” across genomes from different humans or mammals. That is, identify common portions between individuals and only store them once. I often hear that “we share 90%+ of our genome with animal X”.

    There are a few distributed computing projects around that try to predict protein folding, I think.

  • Xezlec

    To the author: I think you may have gotten “genomes can be sequenced in much greater quantities than they can be stored” from the article, but I think that’s a little wrong. There’s really no storage difficulty. A dozen human genomes can fit on a modern USB thumb drive, and it wouldn’t be hard to give everyone a thumb drive containing their DNA. So, throwing away some pieces of the data does not really make sense, IMHO. We can keep it all and analyze it at our leisure. If all you’re saying is that researchers should prioritize and focus on collecting and analyzing the most important genes first, well… duh.

  • Ken Bloom

    Xezlec: Indeed, I’m just taking what I read from the article. I can surely believe that a small number of human genomes is a very manageable amount of data, but the researchers in the article seemed to be trying to deal with much more than that. They’re the ones who are saying that they can’t keep up with it!

  • It is the difference between “casting your nets” wisely or blindly. The wise fisherman knows where to cast. I have been casting my nets over DNA sequence for some decades – hopefully wisely. The results are contained in my books that may be accessed by way of my homepage. My You Tube videos (see attached website) tell the story quite simply even for those who know no more than they do about soccer or wide area networking, whatever that means!

  • Kipp

    With the discovery of miRNAs, RNAi, etc. it’s pretty clear that a lot of what we used to think was “junk DNA” actually isn’t. Only a small amount of the eukaryotic genome codes for RNA that eventually code for proteins, sure – and we used to think this was the only important part. The difference is that genomics and systems biology are looking at thinks that involve higher-order complexity, which means that it’s perilous to disregard any of this data at all.

    As a previous commenter said, the problem isn’t even so much with data as it is with computation: these are processes which are exponentially more intensive to simulate than things in high energy physics. Look at how far protein folding, which is one of the more solvable biological computational issues, still has to go despite efforts over the last 15 years.