A method for achieving complete microbial genomes and improving bins from metagenomics data

Abstract
Metagenomics facilitates the study of the genetic information from uncultured microbes and complex microbial communities. Assembling complete genomes from metagenomics data is difficult because most samples have high organismal complexity and strain diversity. Some studies have attempted to extract complete bacterial, archaeal, and viral genomes and often focus on species with circular genomes so they can help confirm completeness with circularity. However, less than 100 circularized bacterial and archaeal genomes have been assembled and published from metagenomics data despite the thousands of datasets that are available. Circularized genomes are important for (1) building a reference collection as scaffolds for future assemblies, (2) providing complete gene content of a genome, (3) confirming little or no contamination of a genome, (4) studying the genomic context and synteny of genes, and (5) linking protein coding genes to ribosomal RNA genes to aid metabolic inference in 16S rRNA gene sequencing studies. We developed a semi-automated method called Jorg to help circularize small bacterial, archaeal, and viral genomes using iterative assembly, binning, and read mapping. In addition, this method exposes potential misassemblies from k-mer based assemblies. We chose species of the Candidate Phyla Radiation (CPR) to focus our initial efforts because they have small genomes and are only known to have one ribosomal RNA operon. In addition to 34 circular CPR genomes, we present one circular Margulisbacteria genome, one circular Chloroflexi genome, and two circular megaphage genomes from 19 public and published datasets. We demonstrate findings that would likely be difficult without circularizing genomes, including that ribosomal genes are likely not operonic in the majority of CPR, and that some CPR harbor diverged forms of RNase P RNA. Code and a tutorial for this method is available at https://github.com/lmlui/Jorg and is available on the DOE Systems Biology KnowledgeBase as a beta app. Since we cannot culture many microorganisms that are found in the environment, animals, and the human body, scientists rely on shotgun metagenomics to reveal their genomes and to infer their traits and capabilities. However, shotgun metagenomics often only provides fragmented genomes due to limitations of available sequencing technology and bioinformatics tools. We present a semi-automated method called Jorg that can be used to improve and eventually complete (i.e., circular with no misassemblies) prokaryotic and viral genomes from short read metagenomics data, and also include quality checks for misassemblies and completeness. As a proof-of-concept we circularized 36 bacterial genomes and two megaphage genomes. For comparison, there are only ~100 known circularized bacterial genomes from metagenomes from ~30 other studies. We also demonstrate findings that illustrate the utility of circularizing genomes by discovering new biological patterns in Candidate Phyla Radiation species. High-quality circularized genomes produced using this tool also can be used as scaffolds to improve future genome assemblies and as data to improve identification of species in microbiomes.
Funding Information
  • ENIGMA
  • ENIGMA
  • ENIGMA
  • Joint Genome Institute (DE-AC02-05CH11231)
  • National Energy Research Scientific Computing Center (DE-AC02-05CH11231)
  • U.S. Department of Energy Office of Science User Facilities (DE-AC02-05CH11231)

This publication has 81 references indexed in Scilit: