How to do an evaluation: pitfalls and traps

Abstract
The recent literature is replete with papers evaluating computational tools (often those operating on 3D structures) for their performance in a certain set of tasks. Most commonly these papers compare a number of docking tools for their performance in cognate re-docking (pose prediction) and/or virtual screening. Related papers have been published on ligand-based tools: pose prediction by conformer generators and virtual screening using a variety of ligand-based approaches. The reliability of these comparisons is critically affected by a number of factors usually ignored by the authors, including bias in the datasets used in virtual screening, the metrics used to assess performance in virtual screening and pose prediction and errors in crystal structures used.

This publication has 43 references indexed in Scilit: