PGxCorpus, a manually annotated corpus for pharmacogenomics

Open Access

2 January 2020

journal article
research article
Published by Springer Science and Business Media LLC in Scientific Data

Vol. 7 (1), 1-13
https://doi.org/10.1038/s41597-019-0342-9

Abstract

Pharmacogenomics (PGx) studies how individual gene variations impact drug response phenotypes, which makes PGx-related knowledge a key component towards precision medicine. A significant part of the state-of-the-art knowledge in PGx is accumulated in scientific publications, where it is hardly reusable by humans or software. Natural language processing techniques have been developed to guide experts who curate this amount of knowledge. But existing works are limited by the absence of a high quality annotated corpus focusing on PGx domain. In particular, this absence restricts the use of supervised machine learning. This article introduces PGxCorpus, a manually annotated corpus, designed to fill this gap and to enable the automatic extraction of PGx relationships from text. It comprises 945 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly gene variations, genes, drugs and phenotypes), and relationships between those. In this article, we present the corpus itself, its construction and a baseline experiment that illustrates how it may be leveraged to synthesize and summarize PGx knowledge.

Funding Information

Agence Nationale de la Recherche (ANR-15-CE23-0028, ANR-15-CE23-0028, ANR-15-CE23-0028, ANR-15-CE23-0028, ANR-15-CE23-0028, ANR-15-CE23-0028)
Université de Lorraine (15-IDEX-0004)
Snowball Inria Associate Team

This publication has 41 references indexed in Scilit:

DNorm: disease name normalization with pairwise learning to rank
Bioinformatics, 2013
PubTator: a web-based text mining tool for assisting biocuration
Nucleic Acids Research, 2013
A knowledge-driven conditional approach to extract pharmacogenomics specific drug–gene relationships from free text
Journal of Biomedical Informatics, 2012
Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011
BMC Bioinformatics, 2012
Collaborative biocuration--text-mining development task for document prioritization for curation
Database: The Journal of Biological Databases and Curation, 2012
GeneTUKit: a software for document-level gene normalization
Bioinformatics, 2011
PubMed and beyond: a survey of web tools for searching biomedical literature
Database: The Journal of Biological Databases and Curation, 2011
Using text to build semantic networks for pharmacogenomics
Journal of Biomedical Informatics, 2010
Corpus annotation for mining biomedical events from literature
BMC Bioinformatics, 2008
A Coefficient of Agreement for Nominal Scales
Educational and Psychological Measurement, 1960

Cited by 13 articles