Learning string similarity measures for gene/protein name dictionary look-up using logistic regression
Open Access
- 12 August 2007
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 23 (20), 2768-2774
- https://doi.org/10.1093/bioinformatics/btm393
Abstract
Motivation: One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed. Results: We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks. Availability: A dictionary look-up system using the similarity measures described in this article is available at http://text0.mib.man.ac.uk/software/mldic/ Contact:yoshimasa.tsuruoka@manchester.ac.ukKeywords
This publication has 12 references indexed in Scilit:
- A graph-search framework for associating gene identifiers with documentsBMC Bioinformatics, 2006
- BioThesaurus: a web-based thesaurus of protein and gene namesBioinformatics, 2005
- Automatically annotating documents with normalized gene listsBMC Bioinformatics, 2005
- ProMiner: rule-based protein and gene entity recognitionBMC Bioinformatics, 2005
- MaSTerClass: a case-based reasoning system for the classification of biomedical termsBioinformatics, 2005
- Gene name identification and normalization using a model organism databaseJournal of Biomedical Informatics, 2004
- Improving the performance of dictionary-based approaches in protein name recognitionJournal of Biomedical Informatics, 2004
- Identification of related gene/protein names based on an HMM of name variationsComputational Biology and Chemistry, 2004
- Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction ResourceGenome Research, 2003
- Learning string-edit distanceIeee Transactions On Pattern Analysis and Machine Intelligence, 1998