Regional sequence expansion or collapse in heterozygous genome assemblies
Open Access
- 31 July 2020
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 16 (7), e1008104
- https://doi.org/10.1371/journal.pcbi.1008104
Abstract
High levels of heterozygosity present a unique genome assembly challenge and can adversely impact downstream analyses, yet is common in sequencing datasets obtained from non-model organisms. Here we show that by re-assembling a heterozygous dataset with variant parameters and different assembly algorithms, we are able to generate assemblies whose protein annotations are statistically enriched for specific gene ontology categories. While total assembly length was not significantly affected by assembly methodologies tested, the assemblies generated varied widely in fragmentation level and we show local assembly collapse or expansion underlying the enrichment or depletion of specific protein functional groups. We show that these statistically significant deviations in gene ontology groups can occur in seemingly high-quality assemblies, and result from difficult-to-detect local sequence expansion or contractions. Given the unpredictable interplay between assembly algorithm, parameter, and biological sequence data heterozygosity, we highlight the need for better measures of assembly quality than N50 value, including methods for assessing local expansion and collapse. In the genomic era, genomes must be reconstructed from fragments using computational methods, or assemblers. How do we know that a new genome assembly is correct? This is important because errors in assembly can lead to downstream problems in gene predictions and these inaccurate results can contaminate databases, affecting later comparative studies. A particular challenge occurs when a diploid organism inherits two highly divergent genome copies from its parents. While it is widely appreciated that this type of data is difficult for assemblers to handle properly, here we show that the process is prone to more errors than previously appreciated. Specifically, we document examples of regional expansion and collapse, affecting downstream gene prediction accuracy, but without changing the overall genome assembly size or other metrics of accuracy. Our results suggest that assembly evaluation methods should be altered to identify whether regional expansions and collapses are present in the genome assembly.This publication has 50 references indexed in Scilit:
- Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome AssemblyPLOS ONE, 2013
- REAPR: a universal tool for genome assembly evaluationGenome Biology, 2013
- SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell SequencingJournal of Computational Biology, 2012
- Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence dataBioinformatics, 2012
- De novo genome assembly: what every biologist should knowNature Methods, 2012
- Sequencing technologies and genome sequencingJournal of Applied Genetics, 2011
- A fast, lock-free approach for efficient parallel counting of occurrences of k-mersBioinformatics, 2011
- Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing TechnologiesJournal of Computational Biology, 2010
- Applications of next-generation sequencing technologies in functional genomicsGenomics, 2008
- OrthoMCL: Identification of Ortholog Groups for Eukaryotic GenomesGenome Research, 2003