Repetitive DNA and next-generation sequencing: computational challenges and solutions

Abstract
New high-throughput sequencing technologies have spurred explosive growth in the use of sequencing to discover mutations and structural variants in the human genome and in the number of projects to sequence and assemble new genomes. Highly efficient algorithms have been developed to align next-generation sequences to genomes, and these algorithms use a variety of strategies to place repetitive reads. Ambiguous mapping of sequences that are derived from repetitive regions makes it difficult to identify true polymorphisms and to reconstruct transcripts. Short read lengths combined with mapping ambiguities lead to false reports of single-nucleotide polymorphisms, inserts, deletions and other sequence variants. When assembling a genome de novo, repetitive sequences can lead to erroneous rearrangements, deletions, collapsed repeats and other assembly errors. Long-range linking information from paired-end reads can overcome some of the difficulties in short-read assembly.