Thursday, June 30, 2005

Things that are not exons

I have thought for many years that the genomics community needs a term other than 'exon' for coding segments. This post points out how lacking such a name has led to misuse of the word 'exon'. I also suggest that the word 'croe' be used instead, but my primary purpose is to call attention to the need for new names. I would be happy to have other names used properly.


This was presented at the Alternative Splicing SIG at ISMB. My presentation in PowerPoint form is available here and is posted on my web site as Posting 3. My hope is that the term be introduced into the Sequence Ontology, but I'll leave it up to my friends there to get it right.

An exon is defined as a segment of a gene that is present in the mature mRNA product of that gene. Genes for noncoding RNAs that are spliced are divided into exons and introns (examples include tRNAs and rRNAs, as well as a variety of noncoding RNA polymerase II transcripts) and every spliced mRNA has at least two exons that are partly noncoding, containing the 5' UTR and the 3' UTR. However, the need to refer to isolated coding segments that are often complete exons but are sometimes only a part of an exon has led many people to use the term 'exon' inappropriately, and this has created confusion. In one extreme case, a published paper presents an "exon size distribution" which includes many coding segments that are only part of an exon. There are many other examples.

Some people are careful to get it right, and many of them use the term CDS to refer to these coding segments. For example, Michael Zhang, in his excellent 2002 review of computational genefinding (PubMed) writes "To discriminate CDS from intervening sequence, the best content measures are the so-called frame-specific hexamer frequencies" and "... hexamer frequencies alone can detect most [long] CDS regions." However, CDS has shortcomings as a word. Foremost among them is its ambiguous meaning. The same exact term is used to refer to the entire coding region of a gene. This is analogous to using the same word for exon and mRNA.

I am grateful to Myles Axton (Nature Genetics 37 :15 (01 Jan 2005) "Touching Base; Full Text | PDF |) for introducing the readers of Nature Genetics to his term for coding segments that are less than an entire exon, which is CROE ( coding region of an exon, pronounced as in "crow"). Because the term 'exon' never communicates anything about where coding information lies, it is important that the term 'croe' apply as well to coding regions that are coincident with an exon. People should be able to say "the croes of this gene" when they refer to the units that together make up a full CDS.

Alternatively spliced segments. I have a related concern that there be a term for segments that appear as indels when two alternatively spliced mRNAs (or cDNAs) are compared. This can be a complete exon, part of an exon (occurring between two alternative splice sites) or an intron, and need not be coding. Kondrashov and Koonin refer to these various mechanisms as generating LDAS (length difference alternative splicing; 2003 PubMed | Trends in Genetics 19:115-9) but do not suggest a name for the segments themselves (other than "alternative segment," or "inserted alternative segment," which terms they use repeatedly). One idea is 'asproe,' for alternatively spliced region of an exon, which has the advantage of being paired with croe (but the disadvantage that a single insertion may consist of two or more croes, alternatively spliced region of exons and will often be less than an entire croe). It is a useful concept. If one has in hand cDNA or EST sequences that differ by an insertion the mode of alternative splicing is unknown, but the alternatively spliced region is clear, even when genomic sequence is not available. Finally, there could be two terms here. One to refer to the alternative segment at the nucleotide level and another to refer to the alternative segment at the protein level. These need not correspond; an interesting case is where the length of the segment is not a multiple of 3 nucleotides, so that the coding of downstream regions is affected. A classical case, found in the first complete eukaryotic genome sequence (SV40), comes from the small t antigen, in which overlapping reading frames are created by alternative splicing.

No comments: