Abstract
The mean (G+C) composition (51.0%) and standard deviation (±3.8%) of published DNA sequences accounting for 10% of the E. coli genome is in excellent agreement with the principal overall distribution determined by high resolution melting. While differences in base and neighbor characteristics are small and uniform throughout all regions of the genome, it is found that the (G+C) content of sequences varies in segmented fashion within boundaries corresponding to coding (53% G+C) and noncoding (46% G+C) regions; with variances in the latter being six-fold greater than in coding regions. The variance in different regions shows a strong negative dependence on (G+C) content of the region, reflecting the condition that A-T and G-C base pairs are preferred neighbors of A-T and C-G pairs, respectively; with the bias increasing with decreasing (G+C) content. Neighbor analysis indicates the most extreme positive biases occur in AA, TT, GC and CG throughout all regions, but particularly in noncoding regions. Extraordinary numbers of oligomeric strings of (A)n, etc., are the further consequence of this bias. These and other characteristics point to the existence of inherent biases in neighbor frequencies levied during replication or repair, and which reflect, in turn, neighbor influences during mutation. The bias in codon usage noted by Grantham and others is seen here as due, in part, to the adaptation of coding sequences to this microenvironment through selection among synonymous codons so as to preserve inherent neighbor biases.