Friday, July 22, 2005

Parameters for using blastn with noncoding queries

If one wants to look for a conserved noncoding RNA in a new genome using the best possible tools, then one should use sophisticated structure-based methods such as Klein and Eddy's RSEARCH ( BMC Bioinformatics 4:44 , PubMed), and should consult the RNA database Rfam (Griffths-Jones et al., 2003: Rfam: an RNA family database and 2005: Rfam: annotating non-coding RNAs in complete genomes . PubMed). However, alignment tools such as blast or fasta are more readily available, so it is often expedient to use alignment when other tools would do better. If you do that, you must adjust the parameters – you will never find noncoding RNAs using the default parameters for blast. I confronted this problem at the Drosophila genome jamboree in 1999 and published the parameters I used there in the paper I wrote with Helen Salz (J. Biol. Chem.; PubMed).

Now, I've posted a discussion and "how-to" guide (Posting 4 on SteveMount.com) based on work that Chau Nguyen (a University of Maryland Computer Science and Biology double major) did with me a few years ago. These are written for use on the NCBI blastn server, but are easily adapted to running blast locally. Briefly, my advice is to use the parameters -r 5 -q -4 -G 10 -E 4 -W 7. These values not only find mammalian U6atac using plant U6atac but provide an alignment across the entire snRNA. If you don't find what you want, you may want to make adjustments based on the more thorough discussion in the posting, where I describe several parameter sets there that will correctly idenfity plant snRNA genes using animal snRNA queries.

Bear in mind two caveats: limit the size of your query and be prepared to use independent criteria for identifying correct hits. These searches require more computing resources than standard blast searches and it will generally take longer than the estimated time for your results to come back. For related reasons, you should not attempt to use these parameters for queries longer than about 500 bp. (if you are using a noncoding RNA as the query do not include nontranscribed flanking sequence in your query; you may even want to remove poorly conserved parts of the RNA itself from your query). Also, because the assumptions that go into calculating E values are violated by these parameters, the E values reported in your output will be meaningless (except as relative numbers; better matches will still have lower E values). Do not pay attention to the E values (except when comparing results obtained with the same parameter set) and do not report them. However, the lack of reliable E values is not license to believe nonsense; your results should be validated by external criteria such as secondary structure and conservation of known functional regions.

Good luck! If you have experience that bears on this, or can cite relevant literature, please let me know and I'll update the posting.

No comments: