[7] Finding protein similarities with nucleotide sequence databases

Abstract
In this chapter we describe strategies for the searching of translated nucleotide sequence databases. By applying standard searching techniques developed for protein databases,6 we have found that previously unrecognized homologies can be detected. In addition, we have shown that extremely high sensitivity can be obtained using the scoring matrix strategy11 for short regions of similarity. The latter approach is particularly effective for detecting homologs found at the ends of sequences and within data of poor quality. These individual methods are demonstrated for the LysR family of bacterial activator proteins. Successive applications of these methods allow for sensitive detection of complex relationships, as demonstrated for the AraC family and for the complex LuxR-OmpR-NtrC families of bacterial activator proteins. Although our examples are drawn from bacterial sequences, these methods are likewise effective for higher eukaryotic genomic sequences, where protein-coding sequences are usually interrupted by introns. This should be particularly important in the future, since much of the expected increase in nucleotide sequence databases is likely to come from eukaryotic genomic sequencing projects.