Saturday, September 08, 2012

ENCODE: Data, Junk and Hype

This week saw the publication of dozens of papers in Nature, Science and Genome Research that report an initial analysis of data from the Encyclopedia of DNA Elements (ENCODE) project on RNA, transcription initiation, transcription factor association, chromatin structure and histone modification.  The scale of this data is staggering, and it will change how human molecular genetics is done.  Imagine how the field of climatology would be changed if they suddenly had hundreds of years of complete weather data from thousands of weather stations.  This is comparable.
ENCODE data, visualized with the UCSC genome browser.
What ENCODE does not do is fundamentally change our view of what the genome looks like.

The third and fourth sentences of the main article in Nature are these:
These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation.
This "result" has been emphasized in the popular press.

Hype: This lead article in Thursday's copy of the Washington Post Express (a publication of the Washington Post distributed on DC's Metro) is typical of how the story was covered. 
In particular, the conclusion that this study "overturns theory of 'junk DNA' in the genome," which was the title of the article in The Guardian and which was echoed by many who should know better (e.g. Science) is, well, junk. What the ENCODE project has done is locate the sites on human DNA that are represented in RNA, and the sites at which numerous factors bind.  Because 80% of the genome has some biochemical "function" of this sort does not mean that 80% of the genome has some effect on gene expression (although these data will help us immensely in the task of figuring out which noncoding nucleotides do indeed affect gene expression), and we can still be quite sure that most of that 80% does not have any biological function in the usual sense of the word, which is that if you delete it or alter it, something that matters biologically or medically will change.  We still know that most of the millions of single nucleotide polymorphisms that distinguish any two copies of the genome don't matter very much.  It is simply not the case that the vast majority of the human genome has some (biological) functional importance.

Conversely, we have known for a long time that a lot of noncoding DNA does have a function.  Most of the sequence that does matter is not coding.  One measure of that is conservation, and the earliest complete mammalian genomes, in 2005, showed that about 5.3% is conserved among mammals (vs. only about 1% that is coding).  A direct attempt to use ENCODE (and 1000 genomes) data to estimate the fraction of the genome under purifying selection (Ward and Kellis, this week) finds "an additional 4% of the human genome subject to lineage-specific constraints."  While this is a big increase in the estimated fraction of the genome subject to purifying selection, the total is still only about 10%, leaving 90% as neutral.

We have also known for a long time that most RNA transcripts do not result in cytoplasmic messenger RNAs (Salditt-Georgieff and Darnell JE Jr. publised a paper in 1981 with the title "Further evidence that the majority of primary nuclear RNA transcripts in mammalian cells do not contribute to mRNA.") and specific transcripts in noncoding regions were described by the end of the 1980s.

The science blogosphere has been aflame for the last two days as scientists attempt to debunk this hype.  Those bloggers (many of whom are authors on the ENCODE papers) have provided excellent summaries of the issues surrounding the notion of junk DNA.  I have bookmarked several on delicious (tag: ongenetics/ENCODE) and some (mostly the same ones) are listed below.

To my mind, the biggest problem is that what is not news (that not all noncoding DNA is junk) has been allowed to eclipse what is news (that we have a vast trove of data that allows us to assess possible functions for all nucleotides).

The gateway to ENCODE data (through the UC Santa Cruz genome browser)
The ENCODE project web site.
This is Nature's gateway to the literature.  It's a little (OK, a lot) gimmicky, so you probably want to just visit the tables of contents: Nature, Science, Genome Research.

The Finch and the Pea: ENCODE Media Fail
This blog post by Mike White is a survey of media hype documenting numerous errors resulting from the hype (or misplaced focus).

Encode (2012) vs. Comings (1972)
This blog post by T. Ryan Gregory presents a serious review of the concept of "junk DNA."

ENCODE: My [Ewan Birney's] Own Thoughts
Ewan Birney on his own blog.

A Neutral Theory of Molecular Function
This blog post by Michael Eisen "wrestles" with the idea of junk DNA.
I want to end by pointing out that there are lots of people (me and my group included) who have already been wrestling with this issue, with lots of interesting ideas and results already out there. From an intellectual standpoint I’d like to particularly point out the influence the writings of Mike Lynch have had on me – see especially this.
ENCODE: The Rough Guide to the Human Genome
Ed Yong's post (at Discover Magazine), has been revised in the last day or so to be more cautious about the hype.

Cryptogenomicon: ENCODE says what?
This post by Sean Eddy makes the points that "The human genome has a lot of junk DNA," that "Noncoding DNA is part junk, part regulatory, part unknown," that "ENCODE’s definition of 'functional' includes junk" and that "Evolution works on junk."  His post has dozens of comments, mostly from experts in the field.

Finally, a few screen shots from Twitter in the last few days:
Reaction to ENCODE media hype on Twitter ranged from blind propagation to harsh criticism.