Saturday, September 08, 2012

ENCODE: Data, Junk and Hype

This week saw the publication of dozens of papers in Nature, Science and Genome Research that report an initial analysis of data from the Encyclopedia of DNA Elements (ENCODE) project on RNA, transcription initiation, transcription factor association, chromatin structure and histone modification.  The scale of this data is staggering, and it will change how human molecular genetics is done.  Imagine how the field of climatology would be changed if they suddenly had hundreds of years of complete weather data from thousands of weather stations.  This is comparable.
ENCODE data, visualized with the UCSC genome browser.
What ENCODE does not do is fundamentally change our view of what the genome looks like.

The third and fourth sentences of the main article in Nature are these:
These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation.
This "result" has been emphasized in the popular press.


Hype: This lead article in Thursday's copy of the Washington Post Express (a publication of the Washington Post distributed on DC's Metro) is typical of how the story was covered. 
In particular, the conclusion that this study "overturns theory of 'junk DNA' in the genome," which was the title of the article in The Guardian and which was echoed by many who should know better (e.g. Science) is, well, junk. What the ENCODE project has done is locate the sites on human DNA that are represented in RNA, and the sites at which numerous factors bind.  Because 80% of the genome has some biochemical "function" of this sort does not mean that 80% of the genome has some effect on gene expression (although these data will help us immensely in the task of figuring out which noncoding nucleotides do indeed affect gene expression), and we can still be quite sure that most of that 80% does not have any biological function in the usual sense of the word, which is that if you delete it or alter it, something that matters biologically or medically will change.  We still know that most of the millions of single nucleotide polymorphisms that distinguish any two copies of the genome don't matter very much.  It is simply not the case that the vast majority of the human genome has some (biological) functional importance.

Conversely, we have known for a long time that a lot of noncoding DNA does have a function.  Most of the sequence that does matter is not coding.  One measure of that is conservation, and the earliest complete mammalian genomes, in 2005, showed that about 5.3% is conserved among mammals (vs. only about 1% that is coding).  A direct attempt to use ENCODE (and 1000 genomes) data to estimate the fraction of the genome under purifying selection (Ward and Kellis, this week) finds "an additional 4% of the human genome subject to lineage-specific constraints."  While this is a big increase in the estimated fraction of the genome subject to purifying selection, the total is still only about 10%, leaving 90% as neutral.

We have also known for a long time that most RNA transcripts do not result in cytoplasmic messenger RNAs (Salditt-Georgieff and Darnell JE Jr. publised a paper in 1981 with the title "Further evidence that the majority of primary nuclear RNA transcripts in mammalian cells do not contribute to mRNA.") and specific transcripts in noncoding regions were described by the end of the 1980s.

The science blogosphere has been aflame for the last two days as scientists attempt to debunk this hype.  Those bloggers (many of whom are authors on the ENCODE papers) have provided excellent summaries of the issues surrounding the notion of junk DNA.  I have bookmarked several on delicious (tag: ongenetics/ENCODE) and some (mostly the same ones) are listed below.

To my mind, the biggest problem is that what is not news (that not all noncoding DNA is junk) has been allowed to eclipse what is news (that we have a vast trove of data that allows us to assess possible functions for all nucleotides).

Links:
http://genome.ucsc.edu/ENCODE/
The gateway to ENCODE data (through the UC Santa Cruz genome browser)

http://www.genome.gov/10005107
The ENCODE project web site.

http://www.nature.com/encode/
This is Nature's gateway to the literature.  It's a little (OK, a lot) gimmicky, so you probably want to just visit the tables of contents: Nature, Science, Genome Research.

The Finch and the Pea: ENCODE Media Fail
This blog post by Mike White is a survey of media hype documenting numerous errors resulting from the hype (or misplaced focus).

Encode (2012) vs. Comings (1972)
This blog post by T. Ryan Gregory presents a serious review of the concept of "junk DNA."

ENCODE: My [Ewan Birney's] Own Thoughts
Ewan Birney on his own blog.

A Neutral Theory of Molecular Function
This blog post by Michael Eisen "wrestles" with the idea of junk DNA.
I want to end by pointing out that there are lots of people (me and my group included) who have already been wrestling with this issue, with lots of interesting ideas and results already out there. From an intellectual standpoint I’d like to particularly point out the influence the writings of Mike Lynch have had on me – see especially this.
ENCODE: The Rough Guide to the Human Genome
Ed Yong's post (at Discover Magazine), has been revised in the last day or so to be more cautious about the hype.

Cryptogenomicon: ENCODE says what?
This post by Sean Eddy makes the points that "The human genome has a lot of junk DNA," that "Noncoding DNA is part junk, part regulatory, part unknown," that "ENCODE’s definition of 'functional' includes junk" and that "Evolution works on junk."  His post has dozens of comments, mostly from experts in the field.

Finally, a few screen shots from Twitter in the last few days:
Reaction to ENCODE media hype on Twitter ranged from blind propagation to harsh criticism.

Saturday, February 19, 2011

Genetic Genealogy and the Single Segment

Last year, my wife Janet and I sent our DNA off to 23andMe for analysis. Among the tools that they provide is a "Relative Finder," which lists other people on the site who share regions of DNA that appear to be identical by descent. In my case, there are 476 people listed, each sharing between 0.07% and 0.46% of my genome, almost always as a single segment (there are 18 people with whom I share two segments). These people are generally anonymous, but you have an opportunity to make contact and invite them to "share genomes," which means only that you can see which regions are shared.

There are a lot of people on 23andMe who are quite interested in this tool, and who use it for genetic genealogy. Many of these same people also use Family Tree DNA and ancestry.com. As a result of my interactions with these 23andMe relatives, and following the discussions on the 23andMe community forums, I have been thinking about, and researching, what it means to share one segment of DNA by descent with someone. In the process, I have realized some things that are not fully appreciated by most of the genealogy buffs on 23andMe.

I am presenting these insights here, and will consider them one at a time.
  • Distant relatives often share no genetic material at all.
  • It is possible to share a segment with very distant relatives.
  • Sometimes, more distant relationships are more likely.
  • Most of your relatives may be descended from a small fraction of your ancestors.
Distant relatives (fourth cousins and beyond) often share no genetic material.
The chances of not sharing any DNA at all becomes appreciable with fourth cousins and rises to approximately half with fifth cousins. This is based on my own simplified calculations and those of Donnelly (1983), who opines that "proof of descent from William Shakespeare does little to increase the probability that the claimant has genes in common with him." There are limits to what can be accomplished by genetic genealogy that are imposed by the real chance that you simply do not share any DNA at all with distant relatives. The more distant the relationship, the more likely it is that no DNA is shared.

On the other hand, you have to inherit your DNA from somebody, so there are some blocks of identity by descent that have been transmitted many generations.

It is possible to share a segment with very distant relatives.
"The probability that fourth cousins share at least one IBD [identical by descent] segment is 77%, and the expected length of this segment is 10 cM." Now consider the next step. There is a 50% chance that that one shared segment will not be transmitted at all, but a 90% chance that if it is transmitted it will be just as big as it was (the same 10 cM.). What this means for genealogy on 23andMe is that for two people sharing one segment identical by descent there is no way to reliably estimate how far back the common ancestor was. Furthermore, no improvement in software can possibly change that, because the limitation is imposed by the genetics itself.

No matter how far back you go, every nucleotide of one's genome is derived from some ancestor, and even going back 20 generations, the chance that the bit which has been inherited is part of a block 5 cM. or greater is still appreciable. In fact, even for 19th cousins, there is a real chance (13%) that any segment of DNA they have inherited in common will be 5 cM. or greater. This number is based on the term (1 - P(rec))n, where P(rec) is the probability that the segment will be broken up by recombination (1-size/100, where size is in cM.). For 19th cousins sharing a single ancestor, n is 40.

Of course, as mentioned above, there is very little chance that two 19th cousins will share any IBD segments at all, but this is offset if one has many 19th cousins, which is often the case.

Sometimes, more distant relationships are more likely.
23andMe reports a "predicted relationship" (e.g. "4th cousin") and a "relationship range" (e.g. "3rd to 7th cousin"). However, these ranges are likely to be wildly inaccurate, because the likely distance to a common ancestor, given only the information that two people share a single IBD segment, can vary enormously, based largely on how many relatives one has.

Here is my estimate of these values. You can skip this paragraph is you're not interested in the details.
The probability that a segment, if transmitted, will not be broken up by recombination is 1 minus the probability of recombination, which is 5% for a 5 cM. segment, 10% for a 10 cM. segment and so on. (If you are moving up a pedigree, this is the probability the segment was transmitted rather than created by recombination, but the value is the same.)
The probability that a segment is transmitted at all is one-half per generation.
Thus, for an nth cousin sharing a single ancestor, the probability is ((1-P(rec))/2)^(2n+2).
For an nth cousin sharing two ancestors (the usual case), the probability is
2(((1-P(rec))/2)^(2n+2)). For example, the probability of two 4th cousins sharing a specific 5 cM. segment is 2(((0.95))/2)^(10)) = 0.00117. If one has more than 855 4th cousins, then the expected number of 4th cousins sharing this segment will be greater than 1. Because every 4th cousin has the same chance of inheriting the segment, the expected number of 4th cousins who do share the segment will be directly proportional to the number of 4th cousins one has. In the case of 5th cousins, the probability of sharing a specific segment is 2(((0.95))/2)^(12)) = 0.00026, which would require 3,790 cousins for the expected number sharing the segment to exceed 1.0. In general, the number of cousins of a specific degree who should be expected to share a segment is given by

2(((1-P(rec))/2)^(2n+2)) x N

world population growth
where N is the number of relatives of that degree. For a 5 cM. segment, if the number of cousins of degree n+1 that you have is 4.43 times the number of cousins of degree n that you have, then you expect more cousins of degree n+1 than cousins of degree n to share the segment. For a 10 cM. segment, this ratio is 4.94.

Thus, if you have many more distant cousins, as would be expected if your ancestors had large families, then someone who shares a single IBD segment is more likely to be a distant cousin, because you have so many more distant cousins. The point where the increase in the number of cousins outweighs the loss of shared segments is five children per family. This is not extremely uncommon.

As an alternative to the math, consider the case of my (hypothetical) great-great-great-grandfather Joe. Let’s say that I have inherited a 5 cM. segment of DNA from him. (It’s likely that I have inherited at least one segment from him.) Our concern is whether a distant relative that shares this segment is more likely to be a fourth cousin also descended from Joe or a fifth cousin descended from Joe’s father Jacob. The chance that the 5 cM. segment was inherited by Joe, from Jacob, is slightly less than half (because of the possibility of recombination in that generation). Jacob had 12 children, so I can expect to have 12 times as many fifth cousins descended from Jacob as fourth cousins descended from Joe. That fact ends up being more significant than the chance of recombination, so I will share the segment in question with more fifth cousins than fourth cousins. This same logic applies to fifth vs. sixth cousins and so on.

Thus, my 23andMe relatives sharing one IBD segment might be fourth cousins, as predicted, or they might be distant cousins connected by prolific ancestors. There is no way to know.

The world population has increased perhaps 20-fold in the last millennium, but that works out to significantly less growth than the sustained doubling required to predict distant ancestry for people who share one IBD segment. Nevertheless, there are well-documented cases of rapid demographic expansion.
Most of your relatives may be descended from a small fraction of your ancestors.
Given that family size varies a great deal, it is no doubt common to have some ancestors who have left many more descendants than others. We all have 64 great-great-great-great-grandparents, typically in 32 couples. If one family among the 32 had five children and their descendants did as well, while others in the family reproduced at replacement rates (two children per family), then your more prolific ancestors (the parents of just one of your 32 great-great-great-grandparents) would account for over 3/4 of your fourth cousins.

In summary, it is impossible to know the relationship one has to relatives who are discovered by virtue of their sharing a single autosomal segment of DNA. The "predicted relationship" is uncertain, and even the range is hard to be sure of. The extensive information provided by 23andMe is a very useful tool for genealogy, but it cannot tell you about relatives with whom you do not share any genetic material by descent. On the other hand, relatives with whom you do share genetic material by descent can be quite distant.

Sunday, August 01, 2010

Defending science blogs

Although I am not on ScienceBlogs, I am a science blogger, so Virginia Heffernan's article on science bloggers in today's New York Times Magazine ("Unnatural Science: The uses and abuses of science blogging") got my attention. Her position that science blogs are given to "trivia, name-calling, saber rattling" and "gratuitous contempt" compelled me to reply.

The frequency with which I update my blogs is probably best described by a professional journalist as "never," but I do take blogging somewhat seriously, and I try to be professional about it. My affiliation is on the side bar, and I have read (and re-read parts of) such books as "Am I Making Myself Clear?: A Scientist's Guide to Talking to the Public," by Ms. Heffernan's more temperate colleague, Cornelia Dean.

The article starts out with an appeal to deconstructionism:

Deconstructing science is a fool’s game. In the ’90s, literary critics used to try. They’d argue that science is a system of metaphors, complete with a style and an ideology, rather than the royal road to the truth. They were laughed at as cultural relativists, posers high on Gaul­oises and nut jobs who didn’t believe in gravity.

Although amusing and partly true, this is a misrepresentation. Science does have a style and an ideology and some of us acknowledge that. In fact, my own reading of science is informed by an awareness of the differing styles and ideologies that dominate different fields and traditions within science, an awareness that has been made more acute by my own personal exposure (primarily through marriage) to literary criticism, postmodernism and social science. What scientists object to is the notion that science is nothing but a system of metaphors. Scientists uniformly believe that there truths about nature that exist quite apart from ourselves, and that science provides a tool for learning those truths. I will also admit that some of us think that, within academia, posers and nut jobs have a much easier time succeeding in fields outside of science.

Last month ... 20 or so high-placed science bloggers angrily parted ways with an extremely popular and award-winning online collective called ScienceBlogs because it starting running Food Frontiers, a nutrition blog that PepsiCo paid to have on the site.

I missed this. What can I say? I don't find enough time to blog, or even to read other blogs, although I keep thinking I should start doing it more.

ScienceBlogs has become preoccupied with trivia, name-calling and saber rattling. Maybe that’s why the ScienceBlogs ship started to sink.

...

does everyone take for granted now that science sites are where graduate students, researchers, doctors and the “skeptical community” go not to interpret data or review experiments but to chip off one-liners, promote their books and jeer at smokers, fat people and churchgoers?

Perhaps, but the ones I read this morning (those on genetics, including personal genetics) have "interesting stuff." Some of it is a bit pedantic and perhaps not that interesting to the general public, but most of the posts I looked at stuck to the science or discussed policy, and those that discussed policy were perfectly civil.

By the way, I'd recommend "Genomes Unzipped" to readers interested in a diversity of opinion about the week's events surrounding regulation of personal genetics services. Genomes Unzipped is "a group blog providing expert, independent commentary on the personal genomics industry." It is not part of ScienceBlogs, but some individual bloggers post to both.

Under cover of intellectual rigor, the science bloggers — or many of the most visible ones, anyway — prosecute agendas so charged with bigotry that it doesn’t take a pun-happy French critic or a rapier-witted Cambridge atheist to call this whole ScienceBlogs enterprise what it is, or has become: class-war claptrap.

Is she jeering?

Science blogs (including those on ScienceBlogs) are a mixed bag, just like most of the internet, and the New York Times. Readers have to exercise judgment.

Finally, there is a sidebar with recommendations, which I have to applaud.
[Update: Actually, it was a mistake to applaud this. See comments.]

SEMPER SCI
For science that’s accessible but credible, steer clear of polarizing hatefests like atheist or eco-apocalypse blogs. Instead, check out scientificamerican.com, discovermagazine.com and Anthony Watts’s blog, Watts Up With That?

SCIASPORA
David Dobbs, who quit ScienceBlogs, has written well about the consequences of “unbundling” the ScienceBlogs bloggers. See his blog at its new location at neuronculture.com.

(SCI)ENCE
Stanford’s Presidential Lectures in the Humanities are archived — and helpfully linked — at prelectur.stanford.edu. Don’t miss Jacques Derrida’s from the spring of 1999. You will think. You finally almost know. What deconstruction. Is.

Saturday, May 29, 2010

Can we not speak of fish?

I would like to defend the use of paraphyletic groups in scientific discourse and literature. Paraphyletic groups can be well-defined in terms of monophyletic units (as relative complements), and defining paraphyletic groups in terms of monophyletic groups is preferable to treating them as invalid.

Let me start with a story. Wednesday evening (May 26th) I checked my Twitter feed, and saw a number of tweets from Jonathan Eisen (phylogenomics), who was at the ASM meeting.
phylogenomics on TwitterJonathan is in the department of Ecology and Evolution at UC Davis, the author of a popular textbook on Evolution and a frequent blogger ("Tree of Life"). For those of you not used to reading Twitter feeds, note that the most recent tweets are at the top.

Norm Pace bangs on prokaryote 1Norm Pace bangs on prokaryote 1I know both Norm Pace and Jonathan Eisen. Thanks to Norm's personal style and Jonathan's excellent selection of quotes, reading this was like being in the room with Norm. I love hearing him talk. However, I do not entirely agree with him. I have spent my life studying gene expression in eukaryotes, and my perspective is that the differences between eukaryotes and other species ("prokaryotes") are fundamental. In prokaryotes, coupled transcription and translation (which is impossible when there is a nucleus) allows the widespread use of polycistronic mRNAs, which allow operons, which in turn contribute to many important features, including the ease with which biologically useful bits of genetic information can be horizontally transferred. The argument, repeated here by Norm Pace, that "no one can say what a prokaryote is, only what it is not" was addressed by Martin and Koonin, who proposed a "positive definition of prokaryotes" based on coupled transcription and translation. This, however, is not the point. The point is that the nucleus is a derived feature and prokaryotes are a paraphyletic group, meaning that the last common ancestor of all prokaryotes has descendants that are not prokaryotes. Nevertheless, the group is well-defined (as all life other than eukaryotes) and useful, so I commented:

Prokaryotes are a well-defined group.A bit later, I commented again.
Jonathan's not buying it.Prokaryotes are a paraphyletic group. That means that the last common ancestor of all prokaryotes has eukaryotic descendants. Most taxonomists today prefer not to talk about paraphyletic groups at all, but to speak only of monophyletic groups, or clades (which consist entirely of species with a common ancestor). However, there are many paraphyletic groups that "make sense" and are commonly used. Examples include prokaryotes, fish, reptiles and dicots.

My point is that defining a paraphyletic group as the relative complement of one clade with respect to another makes it well-defined, and such a definition more closely suits what people have in mind.

Defining a paraphytic group P as the complement of one monophyletic group, C, with respect to another, G
In the hypothetical example shown here, most taxonomists would want to list "natural taxa" (by which they would mean monophyletic groups, or clades), and would say something like "Q, R and S are slithy." To say "G other than C are slithy" is more compact because it makes reference to fewer taxa. To say "P are slithy" is exactly the same, and is the most compact way of making the statement, but requires reference to a paraphyletic group.

To pursue this further, I asked my colleagues what they thought:
My dear friends in systematics,

I have a question about systematics that I would like your opinion on. It seems a sufficiently central question that I suspect you have already formed an opinion. The issue is a practical one, regarding how biologists should use terms. It is also philosophical (but in the rigorous sense, relating to the idea that without a proper philosophical basis one cannot do science at all).

Consider a monophyletic group of organisms, G, and another phylogenetic group within it, C (for clade). Let us suppose that C is characterized by some fundamental innovation, such that organisms within this clade have a long list of features not found in the other species within G. Furthermore, species within G but not C share a long list of features that have been lost by all species in C. As a result, there is a need to talk about another grouping, W (for wrong), of those species within G but not C. There is no doubt about the phylogeny. C and G are monophyletic but W is not. Molecules and morphology agree. However, all species within W share many features lacking in all species within C, and this is true both morphologically and molecularly.

Is it ever right for a scientist to talk about W as a group?
You know the list (reptiles, fish, dicots, prokaryotes).

Back story.
This came up last night as an argument between Jonathan Eisen and myself, on Twitter. You can see most of it by looking at feeds for
phylogenomics, ongenetics and smount, but given the volatile and perspective-based nature of Twitter feeds I've pasted the relevant tweets into the attached word document (it reads from most recent to earliest so my might want to start at the bottom and work up). Jonathan is at the ASM meeting. He is a Twitter addict who has generated over 4500 tweets in the last year or so (a day with only 10 would be unusual for him). I find it useful and interesting to follow him. I am both ongenetics and smount (I didn't mean to switch but I changed computers and forgot to switch).

I find it useful to refer to prokaryotes (and to fish). Jonathan says "grouping together bacteria/archaea is inappropriate; I note in my evolution textbook we use "bacteria & archaea" a lot". Wouldn't it be simpler if he just used "prokaryote."? I'm looking for advice here.

Thanks,

Steve
I received a thoughtful reply from Chuck Delwiche:
Well, I'm basically with Jonathan on this, although I think I'm slightly more moderate. "Fish," "prokaryotes," "reptiles," "dicots," etc. are really form-classes -- they describe the appearance of the organism, but not its evolutionary relationships. Naming paraphyletic groups is somewhat less objectionable than naming grossly polyphyletic ones, so I don't object to naming the North American Drosophila in a way that ignores the Hawaiian species that are derived from within it (this an example of your C/G case). But it really is confusing to refer to prokaryotes. Although they have coupled transcription and translation, the are other aspects of DNA replication, transcription, and translation that show striking similarities between Archaea and Eukarya. If you talk about "prokaryotes" as if the term represented a lineage rather than a morphology then it tends to obscure both diversity within them similarities between Archaea and Eukarya.

The reason this is important is that hides the predictive value that a natural classification can provide. Within your group G there would be some taxa that are more closely related to C than others, and they will share properties with C despite the long branch and loss of characters you describe. If you treat "fish" as a group it is confusing that Teleosts have immune systems that more nearly resemble those of tetrapods than do those of lampreys or hagfish. I don't know anything about lung- or lobe-finned fish immunology, but I'll bet they are even more tetrapod-like than those of Teleosts. Much the same statements could be made for skeletal structure, tooth anatomy, ventilation mechanisms, and I don't know what all else.

This is why we Must Never Speak of Fish Again.

Chuck
This is pretty much what I expected him to say, but there are two things I'd like to note. The first is this "Within your group G there would be some taxa that are more closely related to C than others, and they will share properties with C despite the long branch and loss of characters you describe. If you treat "fish" as a group it is confusing that Teleosts have immune systems that more nearly resemble those of tetrapods than do those of lampreys or hagfish." This is a very good point that anyone who refers to paraphyletic groups must bear in mind.

The second thing that struck me is that he wrote "'Fish,' 'prokaryotes,' 'reptiles,' 'dicots,' etc. are really form-classes -- they describe the appearance of the organism, but not its evolutionary relationships" despite the fact that I had provided a rigorous definition in terms of evolutionary relationships. Defining "fish" as vertebrates other than tetrapods makes it something other than a form class, and also eliminates any confusion about whales. Defining prokaryotes as organisms other than eukaryotes makes saying "prokaryotes" synonymous with saying "bacteria and archaea." It strikes me that this is a natural group in the sense that if a new domain of life were to be discovered (perhaps on Mars, or deep within the earth) with coupled transcription and translation and no nucleus or mitochondrion, people would want to group it with bacterial and archaea, even if phylogenetic analysis showed that it shared a most recent common ancestor with the eukaryotic nucleus.

In summary, I fully support the definition of taxa as monophyletic groups, and I would like to see them used to more rigorously define paraphyletic groups. Scientists will continue to refer to paraphyletic groups, and for good reasons. When they do, it would be useful if those groups were understood to be the relative complements of monophyletic taxa rather than informal categories, form classes or sloppy and unscientific categories.

I will continue to speak of fish. When I do, I will be referring to vertebrates that are not tetrapods. While I respect and understand colleagues who will never speak of fish, they must understand that this group is well-defined in terms of groups they recognize. I hope that all scientists move towards a more precise taxonomic basis for the groups that they will continue to talk about.

----------------------------------------
I thank Jonathan Eisen, Chuck Delwiche and Charlie Mitter for their contributions to this post. It goes without saying that the opinions expressed are, however, mine. The complete email thread (with Delwiche and Mitter) is available here.

----------------------------------------
Postscript (June 13).
Charlie Mitter forwarded Farris 1979 (Systematic Zoology 28:483-519), which describes the state of systematics at that time as a debate between pheneticists, phylogeneticists and evolutionists about the principles that should underlie a general reference system for biology. I believe that this debate has been fully resolved in favor of the phylogeneticists, and I am fully persuaded that the business of systematics is the definition of monophyletic groups. My points here are that 1) biologists sometimes have good reasons to refer to paraphyletic groups and 2) when they do, it is better, where possible, to understand those groups in terms of monophyletic groups. It is precisely because I agree with the arguments of Farris in favor of phylogenetics that I think that the paraphyletic groups to which scientists will inevitably refer should be defined in terms of phylogenetic taxa (clades) and not thought of as elemental taxonomic units.

Monday, January 12, 2009

Does the rs7901695 C variant predispose to diabetes by creating a cryptic exon?

The discovery of disease-SNP associations through genome-wide association studies continues at a remarkable pace, but a recent review of common variants implicated in type 2 diabetes (T2D) suggests that, at least for this disease, current methods are unlikely to find many additional susceptibility loci (Prokopenko et al. 2008). We are now at the stage where "additional investigation is needed to define the causal variants, ... to understand disease mechanisms and to effect clinical translation." I continue to be interested in the (often underappreciated) contribution of pre-mRNA splicing to variation in gene activity, and I was especially intrigued by the statement that the variant with greatest effect size (the rs7901695 C variant in TCF7L2) lies in an intron but its mechanism of action is not understood. In order to investigate this I submitted the sequence surrounding this variant to SplicePort, our splice site predictor and analysis tool (see Dogan et al. 2007). Sure enough, the rs7901695 C variant alters the predicted strength of nearby splice sites. Because this sequence is deep in an intron, these would be potential cryptic splice sites, sites at which splicing occurs only in the case of a mutation.

SplicePort analysis of rs7901695
Most striking is the activation of a splice acceptor site 68 nucleotides upstream of the variant SNP (position 688 in the submitted sequence or 114,744,012 on chromosome 10). The SplicePort score, which is -0.41 for the T allele, but -0.02 for the C allele, can be understood by noting that while 95.66% of splice acceptors have a score greater than -0.41, 89.01% of splice acceptors score above -0.02. Thus, the C allele acceptor site, although still relatively weak, is clearly better and well within the range of variation for real splice sites (99% of acceptors score above -0.86 and the median score is 0.923).

How might a C to T change affect an acceptor splice site upstream? Spliceport provides a feature browser that lists the features used for scoring any site. In this case, the following "downstream features" contribute to the score of the C variant but not the T variant: cgg (0.112), ctac (0.083), cg (0.072), ctacg (0.06), tacg (0.059), acg (0.043) and acggg (0.035). An independent approach, ESEfinder, similarly identifies this sequence context (CTACGGG but not CTATGGG) as an exonic splicing enhancer potentially recognized by ASF/SF2, SRp40 or SRp55. Thus, the rs7901695 C variant might activate the upstream acceptor site by functioning as part of an exonic splicing enhancer that is activated by one or more SR proteins.

The next question is how the activation of an acceptor splice site deep within an intron would affect gene expression. A splice site that is activated by mutation is known as a cryptic splice site, and an exon that is used only in mutant alleles is referred to as a cryptic exon. Intron mutations that affect gene expression by creating cryptic exons have been known for some time. In fact, I wrote a commentary on several such mutations in the human beta-globin gene over 25 years ago while still a graduate student ("Lessons from mutant globins," Mount and Steitz 1983). Mutations that activate cryptic exons are often overlooked because they lie away from splice sites, and because the resulting RNA is often unstable due to nonsense mediated decay. Nevertheless, there are now hundreds of papers describing such mutations. The case here is especially tricky because the SNP does not directly create a cyptic splice site, but may activate one at a distance.

cryptic exon
Thus, activation of a cryptic exon is a reasonable hypothesis for the effect of the rs7901695 variant on TCF7L2. In this model transcripts from the C allele are more likely than transcripts from the T allele to be aberrantly spliced and ultimately degraded. The lack of EST data supporting a cryptic exon in this region can be explained by nonsense-mediated decay. This proposed mechanism is similar to regulated unproductive splicing and translation ("Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans." Lewis et al. 2003), an important difference being that cryptic exons are generated by mutation rather than being regulated alternative exons.

How likely is this hypothesis? I could not find any papers that have investigated the effect of this variant on splicing. Clearly, the next step is to look for evidence of the cyptic exon and verify that the C variant does indeed introduce an exonic splicing enhancer. There is also the possibility that other SNPs associated with the risk variant haplotype ("HapBT2D"), particularly rs7903146, are more likely to be causative (Helgason et al. 2007). I could not find a direct comparison of the relative risk for these two variants, but it's possible that association data alone will rule out rs7901695, or even that they already have. Colleagues have suggested that I pursue this in my own lab, but I work other things, and there are people in the diabetes field that can do this quickly. I only ask that they cite this post (here's how).

Although additional investigation is needed, the rs7901695 variant is certainly capable of explaining an effect on the expression of TCF7L2 through activation of a cryptic exon. This case is an example of how SplicePort can be used to evaluate the potential of variants to alter splicing. We plan to systematically evaluate the possible effect of all human SNPs on splicing. In the meantime, I strongly encourage investigators to use SplicePort to evaluate variants of interest on their own.

Saturday, November 08, 2008

Remembering C.C. Tan

I read this morning (link) that Tan Jiazhen, better known in the U.S. as C. C. Tan, passed away Nov. 1, at age 99. I suspect that his influence on genetics probably much greater than most Americans appreciate. He worked with the first generation of Drosophila geneticists, and he was Dobzhansky's first Ph.D. student at Cal Tech, yet his career extends into the modern era, and many of the young Chinese scientists coming to the United States now have met him. It's impossible for me to evaluate how much he is responsible for the intellectual "silk road" that contributes so much vitality to twenty first century genetics, but I suspect that without C.C. Tan it would be much less traveled. Interested readers should consult Jim Crow's commentary in genetics (Vol. 164, pg. 1 *) to see how he managed to bring Chinese genetics into the modern era, past the Lysenko years and the Cultural Revolution.

* This page, like most at genetics.org, does not load properly in Firefox on Windows. I'm sure that the GSA will fix that. For now, I just use another browser when I visit the GSA.