Discovery and revision of Arabidopsis genes by proteogenomics

Open Access

30 December 2008

journal article
research article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences of the United States of America

Vol. 105 (52), 21034-21038
https://doi.org/10.1073/pnas.0811066106

Abstract

Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (>99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.

Keywords

This publication has 16 references indexed in Scilit:

Genome-Scale Proteomics Reveals Arabidopsis thaliana Gene Models and Proteome Dynamics
Science, 2008
Improving gene annotation using peptide mass spectrometry
Genome Research, 2006
Expressed Peptide Tags: An Additional Layer of Data for Genome Annotation
Journal of Proteome Research, 2006
AUGUSTUS: ab initio prediction of alternative transcripts
Nucleic Acids Research, 2006
Genomewide comparative analysis of alternative splicing in plants
Proceedings of the National Academy of Sciences of the United States of America, 2006
Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics
Genome Biology, 2006
A Plant-Specific Protein Essential for Blue-Light-Induced Chloroplast Movements
Plant Physiology, 2005
Pack-MULE transposable elements mediate gene evolution in plants
Nature, 2004
Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry
Genome Biology, 2004
Sequencing and comparison of yeast species to identify genes and regulatory elements
Nature, 2003

Cited by 241 articles