Journal of Computational Biology
ISSN / EISSN: 10665277 / 15578666
Published by: Mary Ann Liebert Inc
Total articles ≅ 2,403
Latest articles in this journal
Journal of Computational Biology; https://doi.org/10.1089/cmb.2022.0424
The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.
Journal of Computational Biology; https://doi.org/10.1089/cmb.2022.0422
With the properties of aggressive cancer and heterogeneous tumor biology, triple-negative breast cancer (TNBC) is a type of breast cancer known for its poor clinical outcome. The lack of estrogen, progesterone, and human epidermal growth factor receptor in the tumors of TNBC leads to fewer treatment options in clinics. The incidence of TNBC is higher in African American (AA) women compared with European American (EA) women with worse clinical outcomes. The significant factors responsible for the racial disparity in TNBC are socioeconomic lifestyle and tumor biology. The current study considered the open-source gene expression data of triple-negative breast cancer samples' racial information. We implemented a state-of-the-art classification Support Vector Machine (SVM) method with a recurrent feature elimination approach to the gene expression data to identify significant biomarkers deregulated in AA women and EA women. We also included Spearman's rho and Ward's linkage method in our feature selection workflow. Our proposed method generates 24 features/genes that can classify the AA and EA samples 98% accurately. We also performed the Kaplan–Meier analysis and log-rank test on the 24 features/genes. We only discussed the correlation between deregulated expression and cancer progression with a poor survival rate of 2 genes, KLK10 and LRRC37A2, out of 24 genes. We believe that further improvement of our method with a higher number of RNA-seq gene expression data will more accurately provide insight into racial disparity in TNBC.
Journal of Computational Biology; https://doi.org/10.1089/cmb.2022.0032
Since the analytical solution of the stochastic age-structured human immunodeficiency virus/acquired immune deficiency syndrome model is difficult to solve, establishing an efficient numerical approximation is an important way to predict the dynamic behavior of the model. In this article, a full-discrete scheme is proposed, where the Galerkin finite element method and the positivity preserving truncated Euler–Maruyama scheme are used to discrete the age variable and the time variable, respectively. The error between the numerical solution and the analytical solution is analyzed. Finally, the theoretical results are illustrated by the numerical examples.
Journal of Computational Biology; https://doi.org/10.1089/cmb.2021.0533
Motivation: Phylogenetic trees are often inferred from a multiple sequence alignment (MSA) where the tree accuracy is heavily impacted by the nature of estimated alignment. Carefully equipping an MSA tool with multiple application-aware objectives positively impacts its capability to yield better trees. Results: We introduce Multiobjective Application-aware Multiple Sequence Alignment and Maximum Likelihood Ensemble (MAMMLE), a framework for inferring better phylogenetic trees from unaligned sequences by hybridizing two MSA tools [i.e., Multiple Sequence Comparison by Log-Expectation (MUSCLE) and Multiple Alignment using Fast Fourier Transform (MAFFT)] with multiobjective optimization strategy and leveraging multiple maximum likelihood hypotheses. In our experiments, MAMMLE exhibits 5.57% (4.77%) median improvement (deterioration) over MUSCLE on 50.34% (37.41%) of instances.
Journal of Computational Biology; https://doi.org/10.1089/cmb.2021.0476
Given the wide variability in the quality of next-generation sequencing data submitted to public repositories, it is essential to identify methods that can perform quality control on these data sets when additional quality control data, such as mean tile data, are missing from public repositories. In this study, we present evidence that correlating counts of reads corresponding to pairs of motifs separated over specific distances on individual exons can be used as a proxy mean tile data in the data sets we analyzed and hence could be used when mean tile data are not available. As test data sets we use the Homo sapiens in vitro transcribed (IVT) data set, and a Drosophila melanogaster data set comprising wild and mutant types. We find that a FastQC analysis of the available parts of these data sets demonstrates that the per-tile sequencing quality is good for all the data sets apart from the mutant-type data where the mutant-r3 data are worse than the mutant-r2 data. Correspondingly, intra-exon motif correlations are reasonably large for all data sets except this latter case where the mutant-r2 correlations are low and the mutant-r3 correlations close to zero. We propose that these extremely low correlations are indicative of bias of technical origin, such as flowcell errors. In addition to this, the intra-exon motif correlations as a function of both guanosine-cytosine (GC) content parameters are somewhat higher and less dependent on the GC content parameters in the IVT-Plasmids messenger RNA (mRNA) selection free RNA-Seq sample (control) than in the other RNA-Seq samples that did undergo mRNA selection: both ribosomal depletion (IVT-Only) and PolyA selection (IVT-PolyA, wild type, and mutant).
Journal of Computational Biology; https://doi.org/10.1089/cmb.2022.0391
With the rapid spread of COVID-19 worldwide, viral genomic data are available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis toward the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected—the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this article, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using k-mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
Journal of Computational Biology; https://doi.org/10.1089/cmb.2022.0033
Assay for transposase-accessible chromatin sequencing (ATAC-seq) has become one of the most widely used sequencing methods in studies of gene regulation, aiming to identify open chromatin sites and decipher how chromatin accessibility regulates gene expression. However, due to a lack of programming experience or minimal bioinformatics training, it is difficult for biologists to fully explore and interpret ATAC-seq results. Despite several available programs or websites that allow researchers to analyze and visualize ATAC-seq data, several limitations exist. ATAC-seq data differential expression analysis (ATAC-DEA), a web application that facilitates the exploration and visualization of differential peak analysis and annotation from ATAC-seq data, was developed (www.atac-dea.xyz:3838/ATAC-DEA). ATAC-DEA uses DiffBind and ChIPpeakAnno to process differential peak and annotation analysis results. ATAC-DEA has five features: (1) runs on a web server; (2) processes three files into one small file, which is used as the input for ATAC-DEA; (3) availability of various downloadable plots; (4) multifactor analysis and customized contrast model; and (5) annotates individual, overlapped, and differential peaks. It provides an easy-to-use user interface (UI) design for users to explore the data and modify the parameters interactively based on experimental purposes. ATAC-DEA allows biologists to generate user-friendly visual results from ATAC-seq downstream analysis.
Journal of Computational Biology; https://doi.org/10.1089/cmb.2023.29080.pre
Journal of Computational Biology
Journal of Computational Biology; https://doi.org/10.1089/cmb.2022.0248
The objective of this article is to automatically segment organs at risk (OARs) for thoracic radiology in computed tomography (CT) scan images. The OARs in the thoracic anatomical region during the radiotherapy treatment are mainly the neighbouring organs such as the esophagus, heart, trachea, and aorta. The dataset of 40 patients was used in the proposed work by splitting it into three parts: training, validation, and test sets. The implementation was performed on the Google Colab Pro+ framework with 52 GB of RAM and 265 GB of storage space. An ensemble model was evolved for the automatic segmentation of four OARs in thoracic CT images. U-Net with InceptionV3 as the backbone was used, and different hyperparameters were used during the training of the model. The proposed model achieved precise accuracy for OARs segmentation with an average dice coefficient of 0.9413, Hausdorff value of 0.1838, sensitivity of 0.9783, and specificity of 0.9895 on the Test dataset. An ensembled U-Net InceptionV3 model has been proposed, improving the segmentation results compared with the state-of-the-art techniques such as U-Net, ResNet, Vgg16, etc. The results of the experiments revealed that the proposed model effectively improved the performance of the segmentation of the esophagus, heart, trachea, and aorta.
Journal of Computational Biology; https://doi.org/10.1089/cmb.2022.0390
This article continues the analysis of the recently observed phenomenon of local immunodeficiency (LI), which arises as a result of antigenic cooperation among intrahost viruses organized into a network of cross-immunoreactivity (CR). We study here what happens as the result of combining (connecting) the simplest CR networks, which have a stable state of LI. It turned out that many possibilities occur, particularly resulting in a change of roles of some viruses in the CR network. Our results also give some indications about a boundary of the set of CR networks with stable state of LI in the entire collection of all possible CR networks. Such borderline CR networks are characterized by only a marginally stable (neutral rather than stable) state of the LI, or by the existence of such subnetworks in a CR network that evolve independently of each other (although being connected).