A novel method for mining highly imbalanced high-throughput screening data in PubChem

Open Access

13 October 2009

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 25 (24), 3310-3316
https://doi.org/10.1093/bioinformatics/btp589

Abstract

Motivation: The comprehensive information of small molecules and their biological activities in PubChem brings great opportunities for academic researchers. However, mining high-throughput screening (HTS) assay data remains a great challenge given the very large data volume and the highly imbalanced nature with only small number of active compounds compared to inactive compounds. Therefore, there is currently a need for better strategies to work with HTS assay data. Moreover, as luciferase-based HTS technology is frequently exploited in the assays deposited in PubChem, constructing a computational model to distinguish and filter out potential interference compounds for these assays is another motivation. Results: We used the granular support vector machines (SVMs) repetitive under sampling method (GSVM-RU) to construct an SVM from luciferase inhibition bioassay data that the imbalance ratio of active/inactive is high (1/377). The best model recognized the active and inactive compounds at the accuracies of 86.60% and 88.89 with a total accuracy of 87.74%, by cross-validation test and blind test. These results demonstrate the robustness of the model in handling the intrinsic imbalance problem in HTS data and it can be used as a virtual screening tool to identify potential interference compounds in luciferase-based HTS experiments. Additionally, this method has also proved computationally efficient by greatly reducing the computational cost and can be easily adopted in the analysis of HTS data for other biological systems. Availability: Data are publicly available in PubChem with AIDs of 773, 1006 and 1379. Contact: ywang@ncbi.nlm.nih.gov; bryant@ncbi.nlm.nih.gov Supplementary information: Supplementary data are available at Bioinformatics online.

This publication has 31 references indexed in Scilit:

A Basis for Reduced Chemical Library Inhibition of Firefly Luciferase Obtained from Directed Evolution
Journal of Medicinal Chemistry, 2009
Developing and validating predictive decision tree models from mining chemical structural fingerprints and high–throughput screening data in PubChem
BMC Bioinformatics, 2008
A maximum common substructure-based algorithm for searching and predicting drug-like compounds
Bioinformatics, 2008
Characterization of Chemical Libraries for Luciferase Inhibitory Activity
Journal of Medicinal Chemistry, 2008
Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays
Journal of Computer-Aided Molecular Design, 2008
Bioluminescent Assays for High-Throughput Screening
ASSAY and Drug Development Technologies, 2007
Deriving Knowledge through Data Mining High-Throughput Screening Data
Journal of Medicinal Chemistry, 2004
Strategies for learning in class imbalance problems
Pattern Recognition, 2003
Improving the Odds in Discriminating “Drug-like” from “Non Drug-like” Compounds
Journal of Chemical Information and Computer Sciences, 2000
Support-vector networks
Machine Learning, 1995

Cited by 57 articles