Gene mention normalization and interaction extraction with context models and sentence motifs

Open Access

1 January 2008

journal article
research article
Published by Springer Science and Business Media LLC in Genome Biology

Vol. 9 (Suppl 2), S14
https://doi.org/10.1186/gb-2008-9-s2-s14

Abstract

Background: The goal of text mining is to make the information conveyed in scientific publications accessible to structured search and automatic analysis. Two important subtasks of text mining are entity mention normalization-to identify biomedical objects in text-and extraction of qualified relationships between those objects. Results: We present solutions to gene mention normalization and extraction of protein-protein interactions. For the first task, we identify genes by using background knowledge on each gene, namely annotations related to function, location, disease, and so on. Our approach currently achieves an f-measure of 86.4% on the BioCreative II gene normalization data. For the extraction of protein-protein interactions, we pursue an approach that builds on classical sequence analysis: motifs derived from multiple sequence alignments. The method achieves an f-measure of 24.4%(micro-average) in the BioCreative II interaction pair subtask. Conclusion: For gene mention normalization, our approach outperforms strategies that utilize only the matching of genes names against dictionaries, without invoking further knowledge on each gene. Motifs derived from alignments of sentences are successful at identifying protein interactions in text; the approach we present in this report is fully automated and performs similarly to systems that require human intervention at one or more stages. Availability: Our methods for gene, protein, and species identification, and extraction of protein-protein interactions are available as part of the BioCreative Meta Services (BCMS), see http://bcms.bioinfo.cnio.es/.

Keywords

This publication has 30 references indexed in Scilit:

Overview of BioCreative II gene mention recognition
Genome Biology, 2008
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function
Genome Biology, 2008
A critical assessment of Mus musculus gene function prediction using integrated genomic evidence
Genome Biology, 2008
Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge
Genome Biology, 2008
Consistent probabilistic outputs for protein function prediction
Genome Biology, 2008
Predicting gene function in a hierarchical context with an ensemble of classifiers
Genome Biology, 2008
Manual curation is not sufficient for annotation of genomic databases
Bioinformatics, 2007
Proteome survey reveals modularity of the yeast cell machinery
Nature, 2006
IntAct: an open source molecular interaction database
Nucleic Acids Research, 2004
Multiple sequence alignment with the Clustal series of programs
Nucleic Acids Research, 2003

Cited by 39 articles