AntiFam: a tool to help identify spurious ORFs in protein annotation
Open Access
- 1 January 2012
- journal article
- research article
- Published by Oxford University Press (OUP) in Database: The Journal of Biological Databases and Curation
- Vol. 2012, bas003
- https://doi.org/10.1093/database/bas003
Abstract
As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.Keywords
This publication has 14 references indexed in Scilit:
- The Pfam protein families databaseNucleic Acids Research, 2011
- Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme SuperfamiliesPLoS Computational Biology, 2009
- The Pfam protein families databaseNucleic Acids Research, 2009
- Identifying bacterial genes and endosymbiont DNA with GlimmerBioinformatics, 2007
- A Combined Transmembrane Topology and Signal Peptide Prediction MethodJournal of Molecular Biology, 2004
- Errors in genome annotationTrends in Genetics, 1999
- Go hunting in sequence databases but watch out for the trapsTrends in Genetics, 1996
- [33] Analysis of compositionally biased regions in sequence databasesMethods in enzymology, 1996
- Bacterial peptide chain release factors: conserved primary structure and possible frameshift regulation of release factor 2.Proceedings of the National Academy of Sciences of the United States of America, 1985
- OVERLAPPING GENESAnnual Review of Genetics, 1983