BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Top Cited Papers

Open Access

8 May 2016

journal article
research article
Published by Oxford University Press (OUP) in Database: The Journal of Biological Databases and Curation

Vol. 2016
https://doi.org/10.1093/database/baw068

Abstract

Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community. Database URL : http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/

Keywords

This publication has 26 references indexed in Scilit:

DNorm: disease name normalization with pairwise learning to rank
Bioinformatics, 2013
PubTator: a web-based text mining tool for assisting biocuration
Nucleic Acids Research, 2013
A CTD-Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug-disease and drug-phenotype interactions
Database: The Journal of Biological Databases and Curation, 2013
BioC: a minimalist approach to interoperability for biomedical text processing
Database: The Journal of Biological Databases and Curation, 2013
Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts
Database: The Journal of Biological Databases and Curation, 2012
Recommending MeSH terms for annotating biomedical articles
Journal of the American Medical Informatics Association, 2011
Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction
Journal of Biomedical Informatics, 2011
The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database
Database: The Journal of Biological Databases and Curation, 2011
Understanding PubMed(R) user search behavior through log analysis
Database: The Journal of Biological Databases and Curation, 2009
Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks
Nucleic Acids Research, 2008

Cited by 332 articles