Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques

8 December 2006

journal article
research article
Published by American Chemical Society (ACS) in Journal of Chemical Information and Modeling

Vol. 47 (1), 92-103
https://doi.org/10.1021/ci6002619

Abstract

In this paper, we study the classifications of unbalanced data sets of drugs. As an example we chose a data set of 2D6 inhibitors of cytochrome P450. The human cytochrome P450 2D6 isoform plays a key role in the metabolism of many drugs in the preclinical drug discovery process. We have collected a data set from annotated public data and calculated physicochemical properties with chemoinformatics methods. On top of this data, we have built classifiers based on machine learning methods. Data sets with different class distributions lead to the effect that conventional machine learning methods are biased toward the larger class. To overcome this problem and to obtain sensitive but also accurate classifiers we combine machine learning and feature selection methods with techniques addressing the problem of unbalanced classification, such as oversampling and threshold moving. We have used our own implementation of a support vector machine algorithm as well as the maximum entropy method. Our feature selection is based on the unsupervised McCabe method. The classification results from our test set are compared structurally with compounds from the training set. We show that the applied algorithms enable the effective high throughput in silico classification of potential drug candidates.

Keywords

This publication has 54 references indexed in Scilit:

Efficient optimization of support vector machine learning parameters for unbalanced datasets
Journal of Computational and Applied Mathematics, 2006
Catalytic Site Prediction and Virtual Screening of Cytochrome P450 2D6 Substrates by Consideration of Water and Rescoring in Automated Docking
Journal of Medicinal Chemistry, 2006
Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity
Bioinformatics, 2005
Development of CYP3A4 Inhibition Models: Comparisons of Machine-Learning Techniques and Molecular Descriptors
SLAS Discovery, 2005
Validation of Model of Cytochrome P450 2D6: An in Silico Tool for Predicting Metabolism and Inhibition
Journal of Medicinal Chemistry, 2004
GENERATION AND VALIDATION OF RAPID COMPUTATIONAL FILTERS FOR CYP2D6 AND CYP3A4
Drug Metabolism and Disposition, 2003
Automatic generation of 3D-atomic coordinates for organic molecules
Tetrahedron Computer Methodology, 1990
On molecular identification numbers
Journal of Chemical Information and Computer Sciences, 1984
Principal Variables
Technometrics, 1984
Analytic Atomic Wave Functions
Physical Review B, 1932

Cited by 68 articles