Predictive statistics and artificial intelligence in the U.S. National Cancer Institute's drug discovery program for cancer and AIDS

Abstract
The National Cancer Institute's drug discovery program screens more than 20,000 chemical compounds and natural products a year for activity against a panel of 60 tumor cell lines in vitro. The result is an information‐rich database of patterns that form the basis for what we term an “information‐intensive” approach to the process of drug discovery. The first step was a demonstration, both by statistical methods (including the program COMPARE) and by neural networks, that patterns of activity in the screen can be used to predict a compound's mechanism of action. Given this finding, the overall plan has been to develop three large matrices of information: the first (designated A) gives the pattern of activity for each compound tested against each cell line in the screen; the second (S) encodes any of a number of types of 2‐D or 3‐D structural motifs for each compound; the third (T) indicates each cell's expression of molecular targets (e.g., from 2‐dimensional protein gel electrophoresis). Construction and updating of these matrices is an ongoing process. The matrices can be concatenated in various ways to test a variety of specific hypotheses about compounds screened, as well as to “prioritize” candidate compounds for testing. To aid in these efforts, we have developed the DISCOVERY program package, which integrates the matrix data for visual pattern recognition. The “information‐intensive” approach summarized here in some senses serves to bridge the perceived gap between screening and structure based drug design.