The origins of apicomplexan sequence innovation

Abstract
The Apicomplexa are a group of phylogenetically related parasitic protists that include Plasmodium, Cryptosporidium, and Toxoplasma. Together they are a major global burden on human health and economics. To meet this challenge, several international consortia have generated vast amounts of sequence data for many of these parasites. Here, we exploit these data to perform a systematic analysis of protein family and domain incidence across the phylum. A total of 87,736 protein sequences were collected from 15 apicomplexan species. These were compared with three protein databases, including the partial genome database, PartiGeneDB, which increases the breadth of taxonomic coverage. From these searches we constructed taxonomic profiles that reveal the extent of apicomplexan sequence diversity. Sequences without a significant match outside the phylum were denoted as apicomplexan specialized. These were collated into 9134 discrete protein families and placed in the context of the apicomplexan phylogeny, identifying the putative origin of each family. Most apicomplexan families were associated with an individual genus or species. Interestingly, many genera-specific innovations were associated with specialized host cell invasion and/or parasite survival processes. Contrastingly, those families reflecting more ancestral relationships were enriched in generalized housekeeping functions such as translation and transcription, which have diverged within the apicomplexan lineage. Protein domain searches revealed 192 domains not previously reported in apicomplexans together with a number of novel domain combinations. We highlight domains that may be important to parasite survival.