Genome Research
Journal Information
ISSN / EISSN: 10889051 / 10889051
Published by:
Cold Spring Harbor Laboratory
Total articles ≅ 5,815
Latest articles in this journal
Genome Research; https://doi.org/10.1101/gr.277645.123
Abstract:
Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Whilek-mers and spacedk-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudo-random seeding construct, strobemers, which were empirically demonstrated to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness-sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs, mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to demonstrate that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than usingk-mers when mapping reads at high error rates. As for ANI-estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.
Genome Research; https://doi.org/10.1101/gr.277674.123
Abstract:
The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web-services called Beacons. However, even such limited releases are susceptible to likelihood-ratio-based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query responses for specific variants (e.g., adding noise, as in differential privacy). However, many of these approaches result in a significant utility loss, either suppressing many variants or adding a substantial amount of noise. In this paper, we introduce optimization-based approaches to explicitly trade off the utility of summary data or Beacon responses and privacy with respect to membership-inference attacks based on likelihood-ratios, combining variant suppression and modification. We consider two attack models. In the first, an attacker applies a likelihood-ratio test to make membership-inference claims. In the second model, an attacker uses a threshold that accounts for the effect of the data release on the separation in scores between individuals in the dataset and those who are not. We further introduce highly scalable approaches for approximately solving the privacy-utility tradeoff problem when information is either in the form of summary statistics or presence/absence queries. Finally, we show that the proposed approaches outperform the state of the art in both utility and privacy through an extensive evaluation with public datasets.
Genome Research; https://doi.org/10.1101/gr.277467.122
Abstract:
The diversity outbred (DO) mice and their inbred founders are widely used models of human disease. However, although the genetic diversity of these mice has been well documented, their epigenetic diversity has not. Epigenetic modifications, such as histone modifications and DNA methylation, are important regulators of gene expression, and as such are a critical mechanistic link between genotype and phenotype. Therefore, creating a map of epigenetic modifications in the DO mice and their founders is an important step toward understanding mechanisms of gene regulation and the link to disease in this widely used resource. To this end, we performed a strain survey of epigenetic modifications in hepatocytes of the DO founders. We surveyed four histone modifications (H3K4me1, H3K4me3, H3K27me3, and H3K27ac), and DNA methylation. We used ChromHMM to identify 14 chromatin states, each of which represented a distinct combination of the four histone modifications. We found that the epigenetic landscape was highly variable across the DO founders and was associated with variation in gene expression across strains. We found that epigenetic state imputed into a population of DO mice recapitulated the association with gene expression seen in the founders suggesting that both histone modifications and DNA methylation are highly heritable mechanisms of gene expression regulation. We illustrate how DO gene expression can be aligned with inbred epigenetic states to identify putativecis-regulatory regions. Finally, we provide a data resource that documents strain-specific variation in chromatin state and DNA methylation in hepatocytes across nine widely used strains of laboratory mice.
Genome Research; https://doi.org/10.1101/gr.277669.123
Abstract:
The reconstruction of phylogenetic networks is an important but challenging problem in phylogenetics and genome evolution, as the space of phylogenetic networks is vast and cannot be sampled well. One approach to the problem is to solve the minimum phylogenetic network problem, in which phylogenetic trees are first inferred, then the smallest phylogenetic network that displays all the trees is computed. The approach takes advantage of the fact that the theory of phylogenetic trees is mature and there are excellent tools available for inferring phylogenetic trees from a large number of bio-molecular sequences. A tree-child network is a phylogenetic network satisfying the condition that every non-leaf node has at least one child that is of indegree one. Here, we develop a new method that infers the minimum tree-child network by aligning lineage taxon strings in the phylogenetic trees. This algorithmic innovation enables us to get around the limitations of the existing programs for phylogenetic network inference. Our new program, named ALTS, is fast enough to infer a tree-child network with a large number of reticulations for a set of up to 50 phylogenetic trees with 50 taxa that have only trivial common clusters in about a quarter of an hour on average
Genome Research; https://doi.org/10.1101/gr.277677.123
Abstract:
The assay for transposase-accessible chromatin with sequencing (ATAC-seq) is a common assay to identify chromatin accessible regions by using a Tn5 transposase that can access, cut, and ligate adapters to DNA fragments for subsequent amplification and sequencing. These sequenced regions are quantified and tested for enrichment in a process referred to as "peak calling". Most unsupervised peak calling methods are based on simple statistical models and suffer from elevated false positive rates. Newly developed supervised deep learning methods can be successful, but they rely on high quality labeled data for training, which can be difficult to obtain. Moreover, though biological replicates are recognized to be important, there are no established approaches for using replicates in the deep learning tools, and the approaches available for traditional methods either cannot be applied to ATAC-seq, where control samples may be unavailable, or are post-hoc and do not capitalize on potentially complex, but reproducible signal in the read enrichment data. Here, we propose a novel peak caller that uses unsupervised contrastive learning to extract shared signals from multiple replicates. Raw coverage data are encoded to obtain low-dimensional embeddings and optimized to minimize a contrastive loss over biological replicates. These embeddings are passed to another contrastive loss for learning and predicting peaks and decoded to denoised data under an autoencoder loss. We compared our Replicative Contrastive Learner (RCL) method with other existing methods on ATAC-seq data, using annotations from ChromHMM genome and transcription factor ChIP-seq as noisy truth. RCL consistently achieved the best performance.
Genome Research; https://doi.org/10.1101/gr.277629.122
Abstract:
Summary methods are widely employed to estimate species trees from genome-scale data. However, they can fail to produce accurate species trees when the input gene trees are highly discordant due to estimation error and biological processes, like incomplete lineage sorting. Here, we introduce TREE-QMC, a new summary method that offers accuracy and scalability under these challenging scenarios. TREE-QMC builds upon weighted Quartet Max Cut, which takes weighted quartets as input and then constructs a species tree in a divide-and-conquer fashion, at each step forming a graph and seeking its max cut. The wQMC method has been successfully leveraged in the context of species tree estimation by weighting quartets by their frequencies in the gene trees; we improve upon this approach in two ways. First, we address accuracy by normalizing the quartet weights to account for "artificial taxa" introduced during the divide phase so subproblem solutions can be combined during the conquer phase. Second, we address scalability by introducing an algorithm to construct the graph directly from the gene trees; this gives TREE-QMC a time complexity ofO(n3k), wherenis the number of species andkis the number of gene trees, assuming the subproblem decomposition is perfectly balanced. These contributions enable TREE-QMC to be highly competitive in terms of species tree accuracy and empirical runtime with the leading quartet-based methods, even outperforming them on some model conditions explored in our simulation study. We also present the application of these methods to an avian phylogenomics data set.
Genome Research; https://doi.org/10.1101/gr.277664.123
Abstract:
Mendelian Randomization (MR) has emerged as a powerful approach to leverage genetic instruments to infer causality between pairs of traits in observational studies. However, the results of such studies are susceptible to biases due to weak instruments as well as the confounding effects of population stratification and horizontal pleiotropy. Here, we show that family data can be leveraged to design MR tests that are provably robust to confounding from population stratification, assortative mating, and dynastic effects. We demonstrate in simulations that our approach, MR-Twin, is robust to confounding from population stratification and is not affected by weak instrument bias, while standard MR methods yield inflated false positive rates. We then conducted an exploratory analysis of MR-Twin and other MR methods applied to 121 trait pairs in the UK Biobank dataset. Our results suggest that confounding from population stratification can lead to false positives for existing MR methods, while MR-Twin is immune to this type of confounding, and that MR-Twin can help assess whether traditional approaches may be inflated due to confounding from population stratification.
Genome Research; https://doi.org/10.1101/gr.277585.122
Abstract:
Killer immunoglobulin-like receptor (KIR) genes and human leukocyte antigen (HLA) genes play important roles in innate and adaptive immunity. They are highly polymorphic and cannot be genotyped with standard variant calling pipelines. Compared with HLA genes, many KIR genes are similar to each other in sequences and may be absent in the chromosomes. Therefore, while many tools have been developed to genotype HLA genes using common sequencing data, none of them works for KIR genes. Even the specialized KIR genotypers could not resolve all the KIR genes. Here we describe T1K, a novel computational method for the efficient and accurate inference of KIR or HLA alleles from RNA-seq, whole genome sequencing or whole exome sequencing data. T1K jointly considers alleles across all genotyped genes, so it can reliably identify present genes and distinguish homologous genes, including the challengingKIR2DL5A/KIR2DL5Bgenes. This model also benefits HLA genotyping, where T1K achieves the highest accuracy in benchmarks. Moreover, T1K can call novel single nucleotide variants and process single-cell data. Applying T1K to tumor single-cell RNA-seq data, we found thatKIR2DL4expression was enriched in tumor-specific CD8+T cells. T1K may open the opportunity for HLA and KIR genotyping across various sequencing applications.
Genome Research; https://doi.org/10.1101/gr.277581.122
Abstract:
The mammalian suprachiasmatic nucleus (SCN), located in the ventral hypothalamus, synchronises and maintains daily cellular and physiological rhythms across the body, in accordance with environmental and visceral cues. Consequently, the systematic regulation of spatiotemporal gene transcription in the SCN is vital for daily timekeeping. So far, the regulatory elements assisting circadian gene transcription have only been studied in peripheral tissues, lacking the critical neuronal dimension intrinsic to the role of the SCN as central brain pacemaker. By using histone-ChIP-seq, we identified SCN-enriched gene regulatory elements that associated with temporal gene expression. Based on tissue-specific H3K27ac and H3K4me3 marks we successfully produced the first-ever SCN gene-regulatory map. We found that a large majority of SCN enhancers not only exhibit robust 24-hour rhythmic modulation in H3K27ac occupancy, peaking at distinct times-of-day, but also possess canonical E-box (CACGTG) motifs potentially influencing downstream cycling gene expression. To establish enhancer-gene relationships in the SCN, we conducted directional RNA-seq at six distinct times across day and night and studied the association between dynamically changing histone acetylation and gene transcript levels. About 35% of the cycling H3K27ac sites were found adjacent to rhythmic gene transcripts, often preceding the rise in mRNA levels. We also noted that enhancers encompass noncoding actively transcribing enhancer RNAs (eRNAs) in the SCN, which in turn oscillate, along with cyclic histone acetylation, and correlates with rhythmic gene transcription. Taken together, these findings shed light on genome-wide pretranscriptional regulation operative in the central clock that confers its precise and robust oscillation necessary to orchestrate daily timekeeping in mammals.
Genome Research; https://doi.org/10.1101/gr.276779.122
Abstract:
Hummingbirds are very well adapted to sustain efficient and rapid metabolic shifts. They oxidize ingested nectar to directly fuel flight when foraging but have to switch to oxidizing stored lipids derived from ingested sugars during the night or long-distance migratory flights. Understanding how this organism moderates energy turnover is hampered by a lack of information regarding how relevant enzymes differ in sequence, expression, and regulation. To explore these questions, we generated a chromosome scale genome assembly of the ruby-throated hummingbird (A. colubris) using a combination of long- and short-read sequencing, scaffolding it using existing assemblies. We then used hybrid long- and short-read RNA sequencing of liver and muscle tissue in fasted and fed metabolic states for a comprehensive transcriptome assembly and annotation. Our genomic and transcriptomic data found positive selection of key metabolic genes in nectivorous avian species and deletion of critical genes (SLC2A4,GCK) involved in glucostasis in other vertebrates. We found expression of a fructose-specific version ofSLC2A5putatively in place of insulin-sensitiveSLC2A5, with predicted protein models suggesting affinity for both fructose and glucose. Alternative isoforms may even act to sequester fructose to preclude limitations from transport in metabolism. Finally, we identified differentially expressed genes from fasted and fed hummingbirds suggesting key pathways for the rapid metabolic switch hummingbirds undergo.