BioCreative V CDR task corpus: a resource for chemical disease relation extraction
Top Cited Papers
Open Access
- 8 May 2016
- journal article
- research article
- Published by Oxford University Press (OUP) in Database: The Journal of Biological Databases and Curation
Abstract
Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community. Database URL : http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/Keywords
This publication has 26 references indexed in Scilit:
- DNorm: disease name normalization with pairwise learning to rankBioinformatics, 2013
- PubTator: a web-based text mining tool for assisting biocurationNucleic Acids Research, 2013
- A CTD-Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug-disease and drug-phenotype interactionsDatabase: The Journal of Biological Databases and Curation, 2013
- BioC: a minimalist approach to interoperability for biomedical text processingDatabase: The Journal of Biological Databases and Curation, 2013
- Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstractsDatabase: The Journal of Biological Databases and Curation, 2012
- Recommending MeSH terms for annotating biomedical articlesJournal of the American Medical Informatics Association, 2011
- Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfactionJournal of Biomedical Informatics, 2011
- The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics DatabaseDatabase: The Journal of Biological Databases and Curation, 2011
- Understanding PubMed(R) user search behavior through log analysisDatabase: The Journal of Biological Databases and Curation, 2009
- Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networksNucleic Acids Research, 2008