Comparing Fingerprints for Ligand-Based Virtual Screening: A Fast and Scalable Approach for Unbiased Evaluation

26 October 2020

journal article
research article
Published by American Chemical Society (ACS) in Journal of Chemical Information and Modeling

Vol. 60 (10), 4536-4545
https://doi.org/10.1021/acs.jcim.0c00469

Abstract

Ligand-based virtual screening is a useful tool for drug and probe discovery due to its high accessibility and scalability. The recent identification of bias in many data sets that were used in performance evaluation, quantified by the asymmetric validation embedding (AVE) score, has prompted the reanalysis of models to determine which performs best. Based on the understanding that ligand data are made up of blocks of highly correlated instances, we introduce a technique that quickly generates splits with AVE distributed close to zero using a combination of clustering and removal of the most biased data. We used our technique to compare the performance of the Morgan and CATS fingerprints and show that, after debiasing, the implementation of the CATS fingerprint performs significantly better. The code to replicate these results and perform low-bias splits is available at https://github.com/ljmartin/fp_low_ave.

Funding Information

University of Sydney
National Health and Medical Research Council (1092046)

This publication has 33 references indexed in Scilit:

Open-source platform to benchmark fingerprints for ligand-based virtual screening
Journal of Cheminformatics, 2013
Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.
Journal of Chemical Information and Modeling, 2013
Chemically Advanced Template Search (CATS) for Scaffold‐Hopping and Prospective Target Prediction for ‘Orphan’ Molecules
Molecular Informatics, 2013
Scaffold Hopping Using Two-Dimensional Fingerprints: True Potential, Black Magic, or a Hopeless Endeavor? Guidelines for Virtual Screening
Journal of Medicinal Chemistry, 2010
Extended-Connectivity Fingerprints
Journal of Chemical Information and Modeling, 2010
The Pascal Visual Object Classes (VOC) Challenge
International Journal of Computer Vision, 2009
Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap
Computational Statistics & Data Analysis, 2009
Scaffold‐Hopping: How Far Can You Jump?
QSAR & Combinatorial Science, 2006
On Outliers and Activity CliffsWhy QSAR Often Disappoints
Journal of Chemical Information and Modeling, 2006
The Role of Exchangeability in Inference
The Annals of Statistics, 1981

Cited by 5 articles