ISSN / EISSN : 1471-2105 / 1471-2105
Current Publisher: Springer Science and Business Media LLC (10.1186)Former Publisher:
Total articles ≅ 10,260
Latest articles in this journal
BMC Bioinformatics, Volume 22, pp 1-14; doi:10.1186/s12859-021-04212-6
Background Except for bacteria, the taxonomic diversity of the human fecal metagenome has not been widely studied, despite the potential importance of viruses and eukaryotes. Widely used bioinformatic tools contain limited numbers of non-bacterial species in their databases compared to available genomic sequences and their methodologies do not favour classification of rare sequences which may represent only a small fraction of their parent genome. In seeking to optimise identification of non-bacterial species, we evaluated five widely-used metagenome classifier programs (BURST, Kraken2, Centrifuge, MetaPhlAn2 and CCMetagen) for their ability to correctly assign and count simulations of bacterial, viral and eukaryotic DNA sequence reads, including the effect of taxonomic order of analysis of bacteria, viruses and eukaryotes and the effect of sequencing depth. Results We found that the precision of metagenome classifiers varied significantly between programs and between taxonomic groups. When classifying viruses and eukaryotes, ordering the analysis such that bacteria were classified first significantly improved classification precision. Increasing sequencing depth decreased classification precision and did not improve recall of rare species. Conclusions Choice of metagenome classifier program can have a marked effect on results with respect to precision of species assignment in different taxonomic groups. The order of taxonomic classification can markedly improve precision. Increasing sequencing depth can decrease classification precision and yields diminishing returns in probability of species detection.
BMC Bioinformatics, Volume 22, pp 1-27; doi:10.1186/s12859-021-04150-3
Background Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. Results Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. Conclusions scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics.
BMC Bioinformatics, Volume 22, pp 1-15; doi:10.1186/s12859-021-04231-3
Background Circular RNAs (circRNAs) are a class of single-stranded RNA molecules with a closed-loop structure. A growing body of research has shown that circRNAs are closely related to the development of diseases. Because biological experiments to verify circRNA-disease associations are time-consuming and wasteful of resources, it is necessary to propose a reliable computational method to predict the potential candidate circRNA-disease associations for biological experiments to make them more efficient. Results In this paper, we propose a double matrix completion method (DMCCDA) for predicting potential circRNA-disease associations. First, we constructed a similarity matrix of circRNA and disease according to circRNA sequence information and semantic disease information. We also built a Gauss interaction profile similarity matrix for circRNA and disease based on experimentally verified circRNA-disease associations. Then, the corresponding circRNA sequence similarity and semantic similarity of disease are used to update the association matrix from the perspective of circRNA and disease, respectively, by matrix multiplication. Finally, from the perspective of circRNA and disease, matrix completion is used to update the matrix block, which is formed by splicing the association matrix obtained in the previous step with the corresponding Gaussian similarity matrix. Compared with other approaches, the model of DMCCDA has a relatively good result in leave-one-out cross-validation and five-fold cross-validation. Additionally, the results of the case studies illustrate the effectiveness of the DMCCDA model. Conclusion The results show that our method works well for recommending the potential circRNAs for a disease for biological experiments.
BMC Bioinformatics, Volume 22, pp 1-21; doi:10.1186/s12859-021-04216-2
Background Even when microbial communities vary wildly in their taxonomic composition, their functional composition is often surprisingly stable. This suggests that a functional perspective could provide much deeper insight into the principles governing microbiome assembly. Much work to date analyzing the functional composition of microbial communities, however, relies heavily on inference from genomic features. Unfortunately, output from these methods can be hard to interpret and often suffers from relatively high error rates. Results We built and analyzed a domain-specific microbial trait database from known microbe-trait pairs recorded in the literature to better understand the functional composition of the human microbiome. Using a combination of phylogentically conscious machine learning tools and a network science approach, we were able to link particular traits to areas of the human body, discover traits that determine the range of body areas a microbe can inhabit, and uncover drivers of metabolic breadth. Conclusions Domain-specific trait databases are an effective compromise between noisy methods to infer complex traits from genomic data and exhaustive, expensive attempts at database curation from the literature that do not focus on any one subset of taxa. They provide an accurate account of microbial traits and, by limiting the number of taxa considered, are feasible to build within a reasonable time-frame. We present a database specific for the human microbiome, in the hopes that this will prove useful for research into the functional composition of human-associated microbial communities.
BMC Bioinformatics, Volume 22, pp 1-23; doi:10.1186/s12859-021-04118-3
Background Long-read sequencing is revolutionizing genome assembly: as PacBio and Nanopore technologies become more accessible in technicity and in cost, long-read assemblers flourish and are starting to deliver chromosome-level assemblies. However, these long reads are usually error-prone, making the generation of a haploid reference out of a diploid genome a difficult enterprise. Failure to properly collapse haplotypes results in fragmented and structurally incorrect assemblies and wreaks havoc on orthology inference pipelines, yet this serious issue is rarely acknowledged and dealt with in genomic projects, and an independent, comparative benchmark of the capacity of assemblers and post-processing tools to properly collapse or purge haplotypes is still lacking. Results We tested different assembly strategies on the genome of the rotifer Adineta vaga, a non-model organism for which high coverages of both PacBio and Nanopore reads were available. The assemblers we tested (Canu, Flye, NextDenovo, Ra, Raven, Shasta and wtdbg2) exhibited strikingly different behaviors when dealing with highly heterozygous regions, resulting in variable amounts of uncollapsed haplotypes. Filtering reads generally improved haploid assemblies, and we also benchmarked three post-processing tools aimed at detecting and purging uncollapsed haplotypes in long-read assemblies: HaploMerger2, purge_haplotigs and purge_dups. Conclusions We provide a thorough evaluation of popular assemblers on a non-model eukaryote genome with variable levels of heterozygosity. Our study highlights several strategies using pre and post-processing approaches to generate haploid assemblies with high continuity and completeness. This benchmark will help users to improve haploid assemblies of non-model organisms, and evaluate the quality of their own assemblies.
BMC Bioinformatics, Volume 22, pp 1-22; doi:10.1186/s12859-021-04185-6
Background The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. Results We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. Conclusions We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.
BMC Bioinformatics, Volume 22, pp 1-16; doi:10.1186/s12859-021-04235-z
Background Early detection of bladder cancer remains challenging because patients with early-stage bladder cancer usually have no incentive to take cytology or cystoscopy tests if they are asymptomatic. Our goal is to find non-invasive marker candidates that may help us gain insight into the metabolism of early-stage bladder cancer and be examined in routine health checks. Results We acquired urine samples from 124 patients diagnosed with early-stage bladder cancer or hernia (63 cancer patients and 61 controls). In which 100 samples were included in our marker discovery cohort, and the remaining 24 samples were included in our independent test cohort. We obtained metabolic profiles of 922 compounds of the samples by gas chromatography-mass spectrometry. Based on the metabolic profiles of the marker discovery cohort, we selected marker candidates using Wilcoxon rank-sum test with Bonferroni correction and leave-one-out cross-validation; we further excluded compounds detected in less than 60% of the bladder cancer samples. We finally selected eight putative markers. The abundance of all the eight markers in bladder cancer samples was high but extremely low in hernia samples. Moreover, the up-regulation of these markers might be in association with sugars and polyols metabolism. Conclusions In the present study, comparative urine metabolomics selected putative metabolite markers for the detection of early-stage bladder cancer. The suggested relations between early-stage bladder cancer and sugars and polyols metabolism may create opportunities for improving the detection of bladder cancer.
BMC Bioinformatics, Volume 22, pp 1-22; doi:10.1186/s12859-021-04042-6
Background Quantitative proteomics studies are often used to detect proteins that are differentially expressed across different experimental conditions. Functional enrichment analyses are then typically used to detect annotations, such as biological processes that are significantly enriched among such differentially expressed proteins to provide insights into the molecular impacts of the studied conditions. While common, this analytical pipeline often heavily relies on arbitrary thresholds of significance. However, a functional annotation may be dysregulated in a given experimental condition, while none, or very few of its proteins may be individually considered to be significantly differentially expressed. Such an annotation would therefore be missed by standard approaches. Results Herein, we propose a novel graph theory-based method, PIGNON, for the detection of differentially expressed functional annotations in different conditions. PIGNON does not assess the statistical significance of the differential expression of individual proteins, but rather maps protein differential expression levels onto a protein–protein interaction network and measures the clustering of proteins from a given functional annotation within the network. This process allows the detection of functional annotations for which the proteins are differentially expressed and grouped in the network. A Monte-Carlo sampling approach is used to assess the clustering significance of proteins in an expression-weighted network. When applied to a quantitative proteomics analysis of different molecular subtypes of breast cancer, PIGNON detects Gene Ontology terms that are both significantly clustered in a protein–protein interaction network and differentially expressed across different breast cancer subtypes. PIGNON identified functional annotations that are dysregulated and clustered within the network between the HER2+, triple negative and hormone receptor positive subtypes. We show that PIGNON’s results are complementary to those of state-of-the-art functional enrichment analyses and that it highlights functional annotations missed by standard approaches. Furthermore, PIGNON detects functional annotations that have been previously associated with specific breast cancer subtypes. Conclusion PIGNON provides an alternative to functional enrichment analyses and a more comprehensive characterization of quantitative datasets. Hence, it contributes to yielding a better understanding of dysregulated functions and processes in biological samples under different experimental conditions.
BMC Bioinformatics, Volume 22, pp 1-19; doi:10.1186/s12859-021-04201-9
Background Network models are well-established as very useful computational-statistical tools in cell biology. However, a genomic network model based only on gene expression data can, by definition, only infer gene co-expression networks. Hence, in order to infer gene regulatory patterns, it is necessary to also include data related to binding of regulatory factors to DNA. Results We propose a new dynamic genomic network model, for inferring patterns of genomic regulatory influence in dynamic processes such as development. Our model fuses experiment-specific gene expression data with publicly available DNA-binding data. The method we propose is computationally efficient, and can be applied to genome-wide data with tens of thousands of transcripts. Thus, our method is well suited for use as an exploratory tool for genome-wide data. We apply our method to data from human fetal cortical development, and our findings confirm genomic regulatory patterns which are recognised as being fundamental to neuronal development. Conclusions Our method provides a mathematical/computational toolbox which, when coupled with targeted experiments, will reveal and confirm important new functional genomic regulatory processes in mammalian development.
BMC Bioinformatics, Volume 22, pp 1-22; doi:10.1186/s12859-021-04215-3
Background Accurate prognosis and identification of cancer subtypes at molecular level are important steps towards effective and personalised treatments of breast cancer. To this end, many computational methods have been developed to use gene (mRNA) expression data for breast cancer subtyping and prognosis. Meanwhile, microRNAs (miRNAs) and long non-coding RNAs (lncRNAs) have been extensively studied in the last 2 decades and their associations with breast cancer subtypes and prognosis have been evidenced. However, it is not clear whether using miRNA and/or lncRNA expression data helps improve the performance of gene expression based subtyping and prognosis methods, and this raises challenges as to how and when to use these data and methods in practice. Results In this paper, we conduct a comparative study of 35 methods, including 12 breast cancer subtyping methods and 23 breast cancer prognosis methods, on a collection of 19 independent breast cancer datasets. We aim to uncover the roles of miRNAs and lncRNAs in breast cancer subtyping and prognosis from the systematic comparison. In addition, we created an R package, CancerSubtypesPrognosis, including all the 35 methods to facilitate the reproducibility of the methods and streamline the evaluation. Conclusions The experimental results show that integrating miRNA expression data helps improve the performance of the mRNA-based cancer subtyping methods. However, miRNA signatures are not as good as mRNA signatures for breast cancer prognosis. In general, lncRNA expression data does not help improve the mRNA-based methods in both cancer subtyping and cancer prognosis. These results suggest that the prognostic roles of miRNA/lncRNA signatures in the improvement of breast cancer prognosis needs to be further verified.