UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase

Open Access

12 May 2020

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 36 (17), 4643-4648
https://doi.org/10.1093/bioinformatics/btaa485

Abstract

The number of protein records in the UniProt Knowledgebase (UniProtKB: https://www.uniprot.org) continues to grow rapidly as a result of genome sequencing and the prediction of protein-coding genes. Providing functional annotation for these proteins presents a significant and continuing challenge. In response to this challenge, UniProt has developed a method of annotation, known as UniRule, based on expertly curated rules, which integrates related systems (RuleBase, HAMAP, PIRSR, PIRNR) developed by the members of the UniProt consortium. UniRule uses protein family signatures from InterPro, combined with taxonomic and other constraints, to select sets of reviewed proteins which have common functional properties supported by experimental evidence. This annotation is propagated to unreviewed records in UniProtKB that meet the same selection criteria, most of which do not have (and are never likely to have) experimentally verified functional annotation. Release 2020_01 of UniProtKB contains 6496 UniRule rules which provide annotation for 53 million proteins, accounting for 30% of the 178 million records in UniProtKB. UniRule provides scalable enrichment of annotation in UniProtKB. UniRule rules are integrated into UniProtKB and can be viewed at https://www.uniprot.org/unirule/. UniRule rules and the code required to run the rules, are publicly available for researchers who wish to annotate their own sequences. The implementation used to run the rules is known as UniFIRE and is available at https://gitlab.ebi.ac.uk/uniprot-public/unifire.

Keywords

Funding Information

National Cancer Institute (NCI) of the National Institutes of Health (U24HG007822)
British Heart Foundation (RG/13/5/30112)
Parkinson’s Disease United Kingdom (G-1307)
Alzheimer’s Research UK (ARUK-NAS2017A-1)
National Science Foundation (DBI-1062520, NIH, U41HG02273)
National Institute of General Medical Sciences (R01GM080646, P20GM103446, G08LM010720)

This publication has 12 references indexed in Scilit:

ECO, the Evidence & Conclusion Ontology: community standard for evidence information
Nucleic Acids Research, 2018
InterPro in 2019: improving coverage, classification and access to protein sequence annotations
Nucleic Acids Research, 2018
UniProt: a worldwide hub of protein knowledge
Nucleic Acids Research, 2018
New computational approaches to understanding molecular protein function
PLoS Computational Biology, 2018
HAMAP in 2015: updates to the protein family classification and annotation system
Nucleic Acids Research, 2014
InterProScan 5: genome-scale protein function classification
Bioinformatics, 2014
Locus Reference Genomic sequences: an improved basis for describing human DNA variants
Genome Medicine, 2010
Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies
PLoS Computational Biology, 2009
Protein function prediction – the power of multiplicity
Trends in Biotechnology, 2009
PIRSF Family Classification System for Protein Functional and Evolutionary Analysis
Evolutionary Bioinformatics, 2006

Cited by 42 articles