Evidence-based gene predictions in plant genomes

18 June 2009

journal article
research article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 19 (10), 1912-1923
https://doi.org/10.1101/gr.088997.108

Abstract

Automated evidence-based gene building is a rapid and cost-effective way to provide reliable gene annotations on newly sequenced genomes. One of the limitations of evidence-based gene builders, however, is their requirement for transcriptional evidence—known proteins, full-length cDNAs, or expressed sequence tags (ESTs)—in the species of interest. This limitation is of particular concern for plant genomes, where the rate of genome sequencing is greatly outpacing the rate of EST- and cDNA-sequencing projects. To overcome this limitation, we have developed an evidence-based gene build system (the Gramene pipeline) that can use transcriptional evidence across related species. The Gramene pipeline uses the Ensembl computing infrastructure with a novel data processing scheme. Using the previously annotated plant genomes, the dicot Arabidopsis thaliana and the monocot Oryza sativa, we show that the cross-species ESTs from within monocot or dicot class are a valuable source of evidence for gene predictions. We also find that, using only EST and cross-species evidence, the Gramene pipeline can generate a plant gene set that is comparable in quality to the human genes based on known proteins and full-length cDNAs. We compare the Gramene pipeline to several widely used ab initio gene prediction programs in rice; this comparison shows the pipeline performs favorably at both the gene and exon levels with cross-species gene products only. We discuss the results of testing the pipeline on a 22-Mb region of the newly sequenced maize genome and discuss potential application of the pipeline to other genomes.

Keywords

This publication has 29 references indexed in Scilit:

The Sorghum bicolor genome and the diversification of grasses
Nature, 2009
GeneWise and Genomewise
Genome Research, 2004
The Ensembl Automatic Gene Annotation System
Genome Research, 2004
The Ensembl Analysis Pipeline
Genome Research, 2004
ESTGenes: Alternative Splicing From ESTs in Ensembl
Genome Research, 2004
The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants
Nucleic Acids Research, 2004
Collection, Mapping, and Annotation of Over 28,000 cDNA Clones from japonica Rice
Science, 2003
Functional Annotation of a Full-Length Arabidopsis cDNA Collection
Science, 2002
Integrating genomic homology into gene structure prediction
Bioinformatics, 2001
Genome Annotation Assessment in Drosophila melanogaster
Genome Research, 2000

Cited by 42 articles