Intron positions correlate with module boundaries in ancient proteins

Abstract
We analyze the three-dimensional structure of proteins by a computer program that finds regions of sequence that contain module boundaries, defining a module as a segment of polypeptide chain bounded in space by a specific given distance. The program defines a set of "linker regions" that have the property that if an intron were to be placed into each linker region, the protein would be dissected into a set of modules all less than the specified diameter. We test a set of 32 proteins, all of ancient origin, and a corresponding set of 570 intron positions, to ask if there is a statistically significant excess of intron positions within the linker regions. For 28-A modules, a standard size used historically, we find such an excess, with P < 0.003. This correlation is neither due to a compositional or sequence bias in the linker regions nor to a surface bias in intron positions. Furthermore, a subset of 20 introns, which can be putatively identified as old, lies even more explicitly within the linker regions, with P < 0.0003. Thus, there is a strong correlation between intron positions and three-dimensional structural elements of ancient proteins as expected by the introns-early approach. We then study a range of module diameters and show that, as the diameter varies, significant peaks of correlation appear for module diameters centered at 21.7, 27.6, and 32.9 A. These preferred module diameters roughly correspond to predicted exon sizes of 15, 22, and 30 residues. Thus, there are significant correlations between introns, modules, and a quantized pattern of the lengths of polypeptide chains, which is the prediction of the "Exon Theory of Genes."