Text mining for the biocuration workflow

Open Access

1 January 2012

journal article
research article
Published by Oxford University Press (OUP) in Database: The Journal of Biological Databases and Curation

Vol. 2012, bas020
https://doi.org/10.1093/database/bas020

Abstract

Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.

Keywords

This publication has 17 references indexed in Scilit:

Overview of the BioCreative III Workshop
BMC Bioinformatics, 2011
The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection
Nucleic Acids Research, 2010
Integration of open access literature into the RCSB Protein Data Bank using BioLit
BMC Bioinformatics, 2010
Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD)
BMC Bioinformatics, 2009
Integrating text mining into the MGI biocuration workflow
Database: The Journal of Biological Databases and Curation, 2009
BioLit: integrating biological literature with databases
Nucleic Acids Research, 2008
Overview of the protein-protein interaction annotation extraction task of BioCreative II
Genome Biology, 2008
Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge
Genome Biology, 2008
Automating document classification for the Immune Epitope Database
BMC Bioinformatics, 2007
Taverna: a tool for building and running workflows of services
Nucleic Acids Research, 2006

Cited by 125 articles