Open-source QSAR models for pKa prediction using multiple machine learning approaches

Open Access

18 September 2019

journal article
research article
Published by Springer Science and Business Media LLC in Journal of Cheminformatics

Vol. 11 (1), 1-20
https://doi.org/10.1186/s13321-019-0384-1

Abstract

Background The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction. Methods The experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure–activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN). Results The three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products. Conclusions This work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.

Keywords

This publication has 46 references indexed in Scilit:

Comparison of Different Approaches to Define the Applicability Domain of QSAR Models
Molecules, 2012
Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information
Journal of Computer-Aided Molecular Design, 2011
LIBSVM
ACM Transactions on Intelligent Systems and Technology, 2011
PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints
Journal of Computational Chemistry, 2010
Evaluation of model predictive ability by external validation techniques
Journal of Chemometrics, 2010
Comparison of Nine Programs Predicting pK_a Values of Pharmaceutical Substances
Journal of Chemical Information and Modeling, 2009
In silico pK_a Prediction and ADME Profiling
Chemistry & Biodiversity, 2009
Predicting pK_a
Journal of Chemical Information and Modeling, 2009
Application of ALOGPS to predict 1‐octanol/water distribution coefficients, logP, and logD, of AstraZeneca in‐house database
Journal of Pharmaceutical Sciences, 2004
Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients
Nature, 1962

Cited by 92 articles