Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature

Open Access

7 December 2010

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 27 (3), 408-415
https://doi.org/10.1093/bioinformatics/btq667

Abstract

Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations. Results: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder—a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases. Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles. Availability: Freely available at: http://bioinf.umbc.edu/EMU/ftp. Contact:mkann@umbc.edu Supplementary information: Supplementary data are available at Bioinformatics online.

This publication has 31 references indexed in Scilit:

Moara: a Java library for extracting and normalizing gene and protein mentions
BMC Bioinformatics, 2010
EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts
BMC Bioinformatics, 2009
Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text
BMC Bioinformatics, 2009
McKusick's Online Mendelian Inheritance in Man (OMIM(R))
Nucleic Acids Research, 2009
MutationFinder: a high-performance system for extracting point mutation mentions from text
Bioinformatics, 2007
Automatic Extraction of Protein Point Mutations Using a Graph Bigram Association
PLoS Computational Biology, 2007
NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2007
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
Nucleic Acids Research, 2003
An Upper-Level Ontology for the Biomedical Domain
Comparative and Functional Genomics, 2003

Cited by 91 articles