Regional sequence expansion or collapse in heterozygous genome assemblies

Open Access

31 July 2020

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 16 (7), e1008104
https://doi.org/10.1371/journal.pcbi.1008104

Abstract

High levels of heterozygosity present a unique genome assembly challenge and can adversely impact downstream analyses, yet is common in sequencing datasets obtained from non-model organisms. Here we show that by re-assembling a heterozygous dataset with variant parameters and different assembly algorithms, we are able to generate assemblies whose protein annotations are statistically enriched for specific gene ontology categories. While total assembly length was not significantly affected by assembly methodologies tested, the assemblies generated varied widely in fragmentation level and we show local assembly collapse or expansion underlying the enrichment or depletion of specific protein functional groups. We show that these statistically significant deviations in gene ontology groups can occur in seemingly high-quality assemblies, and result from difficult-to-detect local sequence expansion or contractions. Given the unpredictable interplay between assembly algorithm, parameter, and biological sequence data heterozygosity, we highlight the need for better measures of assembly quality than N50 value, including methods for assessing local expansion and collapse. In the genomic era, genomes must be reconstructed from fragments using computational methods, or assemblers. How do we know that a new genome assembly is correct? This is important because errors in assembly can lead to downstream problems in gene predictions and these inaccurate results can contaminate databases, affecting later comparative studies. A particular challenge occurs when a diploid organism inherits two highly divergent genome copies from its parents. While it is widely appreciated that this type of data is difficult for assemblers to handle properly, here we show that the process is prone to more errors than previously appreciated. Specifically, we document examples of regional expansion and collapse, affecting downstream gene prediction accuracy, but without changing the overall genome assembly size or other metrics of accuracy. Our results suggest that assembly evaluation methods should be altered to identify whether regional expansions and collapses are present in the genome assembly.

This publication has 50 references indexed in Scilit:

Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly
PLOS ONE, 2013
REAPR: a universal tool for genome assembly evaluation
Genome Biology, 2013
SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
Journal of Computational Biology, 2012
Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data
Bioinformatics, 2012
De novo genome assembly: what every biologist should know
Nature Methods, 2012
Sequencing technologies and genome sequencing
Journal of Applied Genetics, 2011
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
Bioinformatics, 2011
Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies
Journal of Computational Biology, 2010
Applications of next-generation sequencing technologies in functional genomics
Genomics, 2008
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes
Genome Research, 2003

Cited by 30 articles