MAGPEL: an autoMated pipeline for inferring vAriant-driven Gene PanEls from the full-length biomedical literature
Open Access
- 23 July 2020
- journal article
- research article
- Published by Springer Science and Business Media LLC in Scientific Reports
- Vol. 10 (1), 1-11
- https://doi.org/10.1038/s41598-020-68649-0
Abstract
In spite of the efforts in developing and maintaining accurate variant databases, a large number of disease-associated variants are still hidden in the biomedical literature. Curation of the biomedical literature in an effort to extract this information is a challenging task due to: (i) the complexity of natural language processing, (ii) inconsistent use of standard recommendations for variant description, and (iii) the lack of clarity and consistency in describing the variant-genotype-phenotype associations in the biomedical literature. In this article, we employ text mining and word cloud analysis techniques to address these challenges. The proposed framework extracts the variant-gene-disease associations from the full-length biomedical literature and designs an evidence-based variant-driven gene panel for a given condition. We validate the identified genes by showing their diagnostic abilities to predict the patients' clinical outcome on several independent validation cohorts. As representative examples, we present our results for acute myeloid leukemia (AML), breast cancer and prostate cancer. We compare these panels with other variant-driven gene panels obtained from Clinvar, Mastermind and others from literature, as well as with a panel identified with a classical differentially expressed genes (DEGs) approach. The results show that the panels obtained by the proposed framework yield better results than the other gene panels currently available in the literature.Keywords
Funding Information
- National Science Foundation (SBIR 1853207)
- U.S. Department of Health & Human Services | NIH | National Institute of Diabetes and Digestive and Kidney Diseases (1R01DK107666-01, 1R01DK107666-01)
- United States Department of Defense | United States Army | Army Medical Command | Congressionally Directed Medical Research Programs (W81XWH-16- 1-0516)
This publication has 80 references indexed in Scilit:
- Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations in prostate cancerNature Genetics, 2012
- Identification of a novel prostate cancer susceptibility variant in the KLK3 gene transcriptHuman Genetics, 2011
- DNMT3AMutations in Acute Myeloid LeukemiaNew England Journal of Medicine, 2010
- Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literatureBioinformatics, 2010
- MutationFinder: a high-performance system for extracting point mutation mentions from textBioinformatics, 2007
- Analysis ofPALB2/FANCN-associated breast cancer familiesProceedings of the National Academy of Sciences of the United States of America, 2007
- PALB2, which encodes a BRCA2-interacting protein, is a breast cancer susceptibility geneNature Genetics, 2006
- Spectrum of Mutations in BRCA1, BRCA2, CHEK2, and TP53 in Families at High Risk of Breast CancerJAMA, 2006
- Molecular mechanisms underlying ErbB2/HER2 action in breast cancerOncogene, 2000
- Genetic Heterogeneity and Penetrance Analysis of the BRCA1 and BRCA2 Genes in Breast Cancer FamiliesAmerican Journal of Human Genetics, 1998