Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature
Open Access
- 7 December 2010
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 27 (3), 408-415
- https://doi.org/10.1093/bioinformatics/btq667
Abstract
Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations. Results: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder—a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases. Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles. Availability: Freely available at: http://bioinf.umbc.edu/EMU/ftp. Contact:mkann@umbc.edu Supplementary information: Supplementary data are available at Bioinformatics online.This publication has 31 references indexed in Scilit:
- Moara: a Java library for extracting and normalizing gene and protein mentionsBMC Bioinformatics, 2010
- EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstractsBMC Bioinformatics, 2009
- Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full textBMC Bioinformatics, 2009
- McKusick's Online Mendelian Inheritance in Man (OMIM(R))Nucleic Acids Research, 2009
- MutationFinder: a high-performance system for extracting point mutation mentions from textBioinformatics, 2007
- Automatic Extraction of Protein Point Mutations Using a Graph Bigram AssociationPLoS Computational Biology, 2007
- NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteinsNucleic Acids Research, 2007
- The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003Nucleic Acids Research, 2003
- An Upper-Level Ontology for the Biomedical DomainComparative and Functional Genomics, 2003