Linguistics of Nucleotide Sequences: Morphology and Comparison of Vocabularies

Abstract
The concept of “words” in continuous languages devoid of blanks is introduced and an operational definition of words given. With this novel concept nucleotide sequences become object for linguistic analysis. The typical word size of the nucleotide language is found to be 3 to 5 (tri-to pentamers). Different genomes have distinct vocabularies. Comparison of these vocabularies can serve as a basis for revealing functional and evolutionary relatedness of sequences.