The use of gene ontology evidence codes in preventing classifier assessment bias

2 March 2009

journal article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 25 (9), 1173-1177
https://doi.org/10.1093/bioinformatics/btp122

Abstract

Motivation: The biological community's reliance on computational annotations of protein function makes correct assessment of function prediction methods an issue of great importance. The fact that a large fraction of the annotations in current biological databases are based on computational methods can lead to bias in estimating the accuracy of function prediction methods. This can happen since predicting an annotation that was derived computationally in the first place is likely easier than predicting annotations that were derived experimentally, leading to over-optimistic classifier performance estimates. Results: We illustrate this phenomenon in a set of controlled experiments using a nearest neighbor classifier that uses PSI-BLAST similarity scores. Our results demonstrate that the source of Gene Ontology (GO) annotations used to assess a protein function predictor can have a highly significant influence on classifier accuracy: the average accuracy over four species and over GO terms in the biological process namespace increased from 0.72 to 0.87 when the classifier was given access to annotations that are assigned evidence codes that indicate a possible computational source, instead of experimentally determined annotations. Slightly smaller increases were observed in the other namespaces. In these comparisons the total number of annotations and their distribution across GO terms were kept the same. Conclusion: In conclusion, taking into account GO evidence codes is required for reporting accuracy statistics that do not overestimate a model's performance, and is of particular importance for a fair comparison of classifiers that rely on different information sources. Contact: rogersma@cs.colostate.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

This publication has 23 references indexed in Scilit:

Design and Analysis of the NIPS2003 Challenge
Studies in Fuzziness and Soft Computing, 2008
Gene Ontology annotation quality analysis in model eukaryotes
Nucleic Acids Research, 2008
Kernel methods for predicting protein-protein interactions
Bioinformatics, 2005
An integrated probabilistic model for functional prediction of proteins
Published by Association for Computing Machinery (ACM) ,2003
Modeling the percolation of annotation errors in a database of protein sequences
Bioinformatics, 2002
Creating the Gene Ontology Resource: Design and Implementation
Genome Research, 2001
Errors in genome annotation
Trends in Genetics, 1999
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Go hunting in sequence databases but watch out for the traps
Trends in Genetics, 1996
Basic local alignment search tool
Journal of Molecular Biology, 1990

Cited by 43 articles