A Genomic Basis for the Evolution of Vertebrate Transcription Factors Containing Amino Acid Runs

Abstract
We have previously shown that polyAla (A) tract-containing proteins frequently present runs of glycine (G), proline (P), and histidine (H) and that, in their ORFs, GC content at all codon positions is higher than that in the rest of the genome. In this study, we present new analyses of these human proteins/ORFs. We detected striking differences in codon usage for A, G, and P in and out of runs. After dividing the ORFs, we found that 5′ halves were richer in runs than 3′ halves. Afterward, when removing the runs, we observed that the run-rich halves (grouped irrespectively of their 5′ or 3′ position) had a marked statistical tendency to have more homo- and hetero-dicodons for A, G, P, and H than the run-poor halves. This suggests that, in addition to the necessary GC-rich genomic background, a specific codon organization is probably required to generate these coding repeats. Homo-dicodons may indeed provide primers for run formation through polymerase slippage. The compositional analysis of human HOX genes, the most polyAla-rich family, and their comparison with their zebrafish homologs, support these hypotheses and suggest possible effects of genomic environment on ORF evolution and organismal diversification.