Abstract
Fibrillar adhesins are bacterial cell surface proteins that mediate interactions with the environment, including host cells during colonization or other bacteria during biofilm formation. These proteins are characterized by a stalk that projects the adhesive domain closer to the binding target. Fibrillar adhesins evolve quickly and thus can be difficult to computationally identify, yet they represent an important component for understanding bacterium-host interactions. To detect novel fibrillar adhesins, we developed a random forest prediction approach based on common characteristics we identified for this protein class. We applied this approach to Firmicutes and Actinobacteria proteomes, yielding over 6,500 confidently predicted fibrillar adhesins. To verify the approach, we investigated predicted fibrillar adhesins that lacked a known adhesive domain. Based on these proteins, we identified 24 sequence clusters representing potential novel members of adhesive domain families. We used AlphaFold to verify that 15 clusters showed structural similarity to known adhesive domains, such as the TED domain. Overall, our study has made a significant contribution to the number of known fibrillar adhesins and has enabled us to identify novel members of adhesive domain families involved in bacterial pathogenesis. IMPORTANCE Fibrillar adhesins are a class of bacterial cell surface proteins that enable bacteria to interact with their environment. We developed a machine learning approach to identify fibrillar adhesins and applied this classification approach to the Firmicutes and Actinobacteria Reference Proteomes database. This method allowed us to detect a high number of novel fibrillar adhesins and also novel members of adhesive domain families. To confirm our predictions of these potential adhesin protein domains, we predicted their structure using the AlphaFold tool.