ProteoClade: A taxonomic toolkit for multi-species and metaproteomic analysis

Abstract
We present ProteoClade, a Python toolkit that performs taxa-specific peptide assignment, protein inference, and quantitation for multi-species proteomics experiments. ProteoClade scales to hundreds of millions of protein sequences, requires minimal computational resources, and is open source, multi-platform, and accessible to non-programmers. We demonstrate its utility for processing quantitative proteomic data derived from patient-derived xenografts and its speed and scalability enable a novel de novo proteomic workflow for complex microbiota samples. Author summary The exponential growth of the number of available reference protein sequences has provided an opportunity to taxonomically annotate and quantify complex mixtures of organisms using bottom-up proteomics. However, the ability to annotate relevant taxa to proteomics data is computationally challenging when data sets generate millions of candidate sequences and the reference database contains billions of peptide sequences. Here, we provide a software tool that enables users to perform taxon-specific quantitation on large proteomic data sets without requiring high performance computing. This tool flexibly enables users to match the reference database settings to their experimental conditions, and can scale from two-organism studies to the entire UniProt repository. In addition, we provide a de novo analysis workflow that enables the identification of organisms in the sample without prior specification, analogous to 16S rRNA sequencing.
Funding Information
  • National Institutes of Health (T32 GM007067-41)
  • National Institutes of Health (R01 CA200893)
  • National Institutes of Health (R21 CA138308)
  • National Institutes of Health (R21 CA179452)