Latest articles in this journal
Genome Research; doi:10.1101/gr.264465.120
Neisseria meningitidis (the meningococcus) is a major human pathogen with a history of high invasive disease burden, particularly in sub-Saharan Africa. Our current understanding of the evolution of meningococcal genomes is limited by the rarity of large-scale genomic population studies and lack of in-depth investigation of the genomic events associated with routine pathogen transmission. Here, we fill this knowledge gap by a detailed analysis of 2839 meningococcal genomes obtained through a carriage study of over 50,000 samples collected systematically in Burkina Faso, West Africa, before, during, and after the serogroup A vaccine rollout, 2009–2012. Our findings indicate that the meningococcal genome is highly dynamic, with highly recombinant loci and frequent gene sharing across deeply separated lineages in a structured population. Furthermore, our findings illustrate how population structure can correlate with genome flexibility, as some lineages in Burkina Faso are orders of magnitude more recombinant than others. We also examine the effect of selection on the population, in particular how it is correlated with recombination. We find that recombination principally acts to prevent the accumulation of deleterious mutations, although we do also find an example of recombination acting to speed the adaptation of a gene. In general, we show the importance of recombination in the evolution of a geographically expansive population with deep population structure in a short timescale. This has important consequences for our ability to both foresee the outcomes of vaccination programs and, using surveillance data, predict when lineages of the meningococcus are likely to become a public health concern.
Genome Research; doi:10.1101/gr.275193.120
Sequencing technologies utilizing nucleotide conversion techniques such as cytosine-to-thymine in bisulfite-seq and thymine-to-cytosine in SLAM-seq are powerful tools to explore the chemical intricacies of cellular processes. To date, no one has developed a unified methodology for aligning converted sequences and consolidating alignment of these technologies in one package. In this paper, we describe HISAT-3N (hierarchical indexing for spliced alignment of transcripts - 3 nucleotides), which can rapidly and accurately align sequences consisting of any nucleotide conversion by leveraging the powerful hierarchical index and repeat index algorithms originally developed for the HISAT software. Tests on real and simulated data sets demonstrate that HISAT-3N is faster than other modern systems, with greater alignment accuracy, higher scalability, and smaller memory requirements. HISAT-3N therefore becomes an ideal aligner when used with converted sequence technologies.
Genome Research; doi:10.1101/gr.268490.120
Regulatory interactions mediated by transcription factors (TFs) make up complex networks that control cellular behavior. Fully understanding these gene regulatory networks (GRNs) offers greater insight into the consequences of disease-causing perturbations than can be achieved by studying single TF binding events in isolation. Chromosomal translocations of the lysine methyltransferase 2A (KMT2A) produce KMT2A fusion proteins such as KMT2A-AFF1, causing poor prognosis acute lymphoblastic leukemias (ALLs) that sometimes relapse as acute myeloid leukemias (AMLs). KMT2A-AFF1 is thought to drive leukemogenesis through direct binding and inducing aberrant overexpression of key gene targets, such as the anti-apoptotic factor BCL2 and the proto-oncogene MYC. However, studying direct binding alone does not allow for network generated regulatory outputs, including the indirect induction of gene repression. To better understand the KMT2A-AFF1 driven regulatory landscape, we integrated ChIP-seq, patient RNA-seq and CRISPR essentiality screens to generate a model GRN. This GRN identified several key transcription factors, including RUNX1, that regulate target genes downstream of KMT2A-AFF1 using feed-forward loop (FFL) and cascade motifs. A core set of nodes are present in both ALL and AML, and CRISPR screening revealed several factors that help mediate response to the drug venetoclax. Using our GRN, we then identified an KMT2A-AFF1:RUNX1 cascade that represses CASP9, as well as KMT2A-AFF1 driven FFLs that regulate BCL2 and MYC through combinatorial TF activity. This illustrates how our GRN can be used to better connect KMT2A-AFF1 behavior to downstream pathways that contribute to leukemogenesis, and potentially predict shifts in gene expression that mediate drug response.
Genome Research; doi:10.1101/gr.275569.121
Single-cell genomics is rapidly advancing our knowledge of the diversity of cell phenotypes, both cell types and cell states. Driven by single-cell/nucleus RNA sequencing (scRNA-seq), comprehensive atlas projects covering a wide range of organisms and tissues are currently underway. As a result, it is critical that the transcriptional phenotypes discovered are defined and disseminated in a consistent and concise manner. Molecular biomarkers have historically played an important role in biological research, from defining immune cell types by surface protein expression to defining diseases by their molecular drivers. Here we describe a machine learning-based marker gene selection algorithm, NS-Forest version 2.0, which leverages the nonlinear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that optimally captures the cell type identity represented in complete scRNA-seq transcriptional profiles. The marker genes selected provide an expression barcode that serves as both a useful tool for downstream biological investigation and the necessary and sufficient characteristics for semantic cell type definition. The use of NS-Forest to identify marker genes for human brain middle temporal gyrus cell types reveals the importance of cell signaling and noncoding RNAs in neuronal cell type identity.
Genome Research; doi:10.1101/gr.273771.120
In animals, distant H3K27me3-marked Polycomb targets can establish physical interactions forming repressive chromatin hubs. In plants, growing evidence suggests that H3K27me3 act directly or indirectly to regulate chromatin interactions, although how this histone modification modulates 3D chromatin architecture remains elusive. To decipher the impact of the dynamic deposition of H3K27me3 on the Arabidopsis thaliana nuclear interactome, we combined genetics, transcriptomics and alternative 3D epigenomic approaches. By analyzing mutants defective for histone H3K27 methylation or demethylation we uncovered the crucial role of this chromatin mark in short- and previously unnoticed long-range chromatin loop formation. We found that a reduction in H3K27me3 led to a decrease in the interactions within Polycomb-associated repressive domains. Regions with lower H3K27me3 levels in the H3K27 methyltransferase clf mutant established new interactions with regions marked with H3K9ac – a histone modification associated with active transcription, thus indicating that a reduction in H3K27me3 levels induces a global reconfiguration of chromatin architecture. Altogether, our results reveal that the 3D genome organization is tightly linked to reversible histone modifications that govern chromatin interactions. Consequently, nuclear organization dynamics shapes the transcriptional reprogramming during plant development and places H3K27me3 as a key feature in the coregulation of distant genes.
Genome Research; doi:10.1101/gr.266528.120
Thousands of species will be sequenced in the next few years; however, understanding how their genomes work without an unlimited budget requires both molecular and novel evolutionary approaches. We developed a sensitive sequence alignment pipeline to identify conserved noncoding sequences (CNSs) in the Andropogoneae tribe (multiple crop species descended from a common ancestor ~18 million years ago). The Andropogoneae share similar physiology while being tremendously genomically diverse, harboring a broad range of ploidy levels, structural variation, and transposons. These contribute to the potential of Andropogoneae as a powerful system for studying CNSs and are factors we leverage to understand the function of maize CNSs. We found that 86% of CNSs were comprised of annotated features, including introns, UTRs, putative cis-regulatory elements, chromatin loop anchors, noncoding RNA genes, and several transposable element superfamilies. CNSs were enriched in active regions of DNA replication in the early S phase of the mitotic cell cycle and showed different DNA methylation ratios compared to the genome-wide background. More than half of putative cis-regulatory sequences (identified via other methods) overlapped with CNSs detected in this study. Variants in CNSs were associated with gene expression levels, and CNS absence contributed to loss of gene expression. Furthermore, the evolution of CNSs was associated with the functional diversification of duplicated genes in the context of maize subgenomes. Our results provide a quantitative understanding of the molecular processes governing the evolution of CNSs in maize.
Genome Research; doi:10.1101/gr.271346.120
Alternative polyadenylation (APA) is a major mechanism of post-transcriptional regulation in various cellular processes including cell proliferation and differentiation, but the APA heterogeneity among single cells remains largely unknown. Single-cell RNA sequencing (scRNA-seq) has been extensively used to define cell subpopulations at the transcription level. Yet, most scRNA-seq data have not been analyzed in an "APA-aware" manner. Here, we introduce scDaPars (Dynamic Analysis of Alternative PolyAdenylation from Single-cell RNA-seq), a bioinformatics algorithm to accurately quantify APA events at both single-cell and single-gene resolution using either 3’ end (10x Chromium) or full-length (Smart-seq2) scRNA-seq data. Validations in both real and simulated data indicate that scDaPars can robustly recover missing APA events caused by the low amounts of mRNA sequenced in single cells. When applied to cancer and human endoderm differentiation data, scDaPars not only revealed cell type-specific APA regulation but also identified cell subpopulations that are otherwise invisible to conventional gene expression analysis. Thus, scDaPars will enable us to understand cellular heterogeneity at the post-transcriptional APA level.
Genome Research; doi:10.1101/gr.271288.120
Recent technological advances have enabled spatially resolved measurements of expression profiles for hundreds to thousands of genes in fixed tissues at single-cell resolution. However, scalable computational analysis methods able to take into consideration the inherent 3D spatial organization of cell types and nonuniform cellular densities within tissues are still lacking. To address this, we developed MERINGUE, a computational framework based on spatial auto-correlation and cross-correlation analysis to identify genes with spatially heterogeneous expression patterns, infer putative cell-cell communication, and perform spatially informed cell clustering in 2D and 3D in a density-agnostic manner using spatially resolved transcriptomics data. We applied MERINGUE to a variety of spatially resolved transcriptomics datasets including multiplexed error-robust fluorescence in situ hybridization (MERFISH), spatial transcriptomics, Slide-Seq, and aligned in situ hybridization (ISH) data. We anticipate that such statistical analysis of spatially resolved transcriptomics data will facilitate our understanding of the interplay between cell state and spatial organization in tissue development and disease.
Genome Research; doi:10.1101/gr.268037.120
Whereas the neurological effects of cocaine have been well documented, effects of acute cocaine consumption on genome-wide gene expression across the brain remain largely unexplored. This question cannot be readily addressed in humans but can be approached using the Drosophila melanogaster model, where gene expression in the entire brain can be surveyed at once. Flies exposed to cocaine show impaired locomotor activity, including climbing behavior and startle response (a measure of sensorimotor integration), and increased incidence of seizures and compulsive grooming. To identify specific cell populations that respond to acute cocaine exposure, we analyzed single-cell transcriptional responses in duplicate samples of flies that consumed fixed amounts of sucrose or sucrose supplemented with cocaine, in both sexes. Unsupervised clustering of the transcriptional profiles of a total of 86,224 cells yielded 36 distinct clusters. Annotation of clusters based on gene markers revealed that all major cell types (neuronal and glial) as well as neurotransmitter types from most brain regions were represented. The brain transcriptional responses to cocaine showed profound sexual dimorphism and were considerably more pronounced in males than females. Differential expression analysis within individual clusters indicated cluster-specific responses to cocaine. Clusters corresponding to Kenyon cells of the mushroom bodies and glia showed especially large transcriptional responses following cocaine exposure. Cluster specific coexpression networks and global interaction networks revealed a diverse array of cellular processes affected by acute cocaine exposure. These results provide an atlas of sexually dimorphic cocaine-modulated gene expression in a model brain.
Genome Research; doi:10.1101/gr.271874.120
Recent development of single-cell RNA-seq (scRNA-seq) technologies has led to enormous biological discoveries. As the scale of scRNA-seq studies increases, a major challenge in analysis is batch effects, which are inevitable in studies involving human tissues. Most existing methods remove batch effects in a low-dimensional embedding space. Although useful for clustering, batch effects are still present in the gene expression space, leaving downstream gene-level analysis susceptible to batch effects. Recent studies have shown that batch effect correction in the gene expression space is much harder than in the embedding space. Popular methods such as Seurat3.0 rely on the mutual nearest neighbor (MNN) approach to remove batch effects in gene expression, but MNN can only analyze two batches at a time and it becomes computationally infeasible when the number of batches is large. Here we present CarDEC, a joint deep learning model that simultaneously clusters and denoises scRNA-seq data while correcting batch effects both in the embedding and the gene expression space. Comprehensive evaluations spanning different species and tissues showed that CarDEC outperforms Scanorama, DCA + Combat, scVI, and MNN. With CarDEC denoising, non-highly variable genes offer as much signal for clustering as the highly variable genes (HVGs), suggesting that CarDEC substantially boosted information content in scRNA-seq. We also showed that trajectory analysis using CarDEC's denoised and batch corrected expression as input revealed marker genes and transcription factors that are otherwise obscured in the presence of batch effects. CarDEC is computationally fast, making it a desirable tool for large-scale scRNA-seq studies.