The SIRAH-CoV-2 Initiative: A Coarse-Grained Simulations' Dataset of the SARS-CoV-2 Proteome

Abstract
During the last decades, the broad community of computational biophysicist/biochemists has developed computational tools to quickly test molecular hypotheses, support, complement, and even substitute experimental data in a reliable and reproducible fashion. As a by-product of these advances, enormous amounts of data are being generated (1). Unfortunately, good practices about data archiving, documenting, and sharing are not in pace with the formidable capacity to produce information. This often results in suboptimal utilization of efforts and resources, leaving authentic “data treasures” undiscovered. This redounds in a useless replication of work, which often times is only needed as input for further investigation rather than representing an objective themselves (1). It, therefore, becomes increasingly important to make computational biophysics data publicly available, searchable, and downloadable, adhering to the “FAIR” principles (2). Among many others, the European community has advanced a large and coordinated initiative, the European Open Science Cloud (EOSC), which is aimed at sharing and re-using scientific content increasing transparency and accountability. OpenAIRE is a socio-technical infrastructure for scholarly communication and Open Science (3). It offers data store ensuring long-term preservation of relatively “big size” datasets. Among others, the Zenodo database (4) provides a simple and fast upload system, with the possibility to immediately obtain a DOI identifier for each data set, including the option to update data sets separately. Early in 2020, the COVID-19 pandemic pervaded virtually all personal and scientific activities with extensive lockdown regimes in most countries across the world. In response to this extraordinary context, the entire scientific community devoted massive efforts to study SARS-CoV-2 at basic and applied levels. The Biocomputing community was not an exception and showed a strong commitment endorsed by hundreds of groups around the globe (5). Many researchers reoriented their priorities, offering a swift response to the emergency, providing fresh structural and dynamical perspectives on viral variability, drug targets, effect of mutations, etc. (5). As a result, only a few months after the beginning of the pandemic, it was possible to find many data-sharing initiatives scattered in different portals and repositories. In this context, our group undertook the initiative of simulating and sharing the raw data of coarse-grained (CG) simulations of the SARS-CoV-2 proteins reported in the PDB, in the apo state. Figure 1 shows the representative structures reported in the PDB database until October 30, 2020. We named this “The SIRAH-CoV-2 Initiative,” which was carried out in collaboration with the Uruguayan National Center for Supercomputing, ClusterUY (https://www.cluster.uy) (8). The raw data for individual CG Molecular Dynamics (MD) simulations is available from the Zenodo database (9–28). Figure 1. Schematic representation of the SARS-CoV-2 genome and associated proteins. All the proteins simulated are presented as cartoon and colored according to their secondary structure following the standard VMD color code (6). Glycans are presented as sticks and colored according to the SNFG color scheme (7). The D614G mutation was introduced in the soluble domain of the wild type Spike protein (green asterisk). Simulations were performed using the SIRAH force field 2.0 (29) running with the Amber18 suite (http://ambermd.org) at ClusterUY. Interaction parameters for bound divalent cations and glycans were reported by Klein et al. (30) and Garay et al. (31), respectively. Coordinates were downloaded from the PDB database (PDBs id: 6VYO, 6W01, 6LU7, 6W02, 6W4B, 6M3M, 6W9C, 6W4H, 6W41, 6YHU, 6W37, 6WIQ, 7BTF, 6M17, 6VSB, 6M1V, 6XDC, 6ZSL, 6XEY, 6XR8). Non-protein, non-glycan molecules and ions not coordinated by proteins were removed (e.g., water and molecules present in crystallization buffers). When deemed necessary, missing residues were reconstructed with ModLoop (32). The D614G mutation in the SARS-CoV-2 Spike protein was introduced on the wild type structure (PDB id: 6XR8) by simply deleting the side chain of Asp614 and renaming the residue. Only in this particular case, missing loops were completed using SWISS-MODEL at https://swissmodel.expasy.org. All structures were protonated using PDB2PQR (33) at a pH = 7. The orientation of the protein in PDB id 6XDC was set according to the OPM database (https://opm.phar.umich.edu/proteins/5172), with a pre-equilibrated patch containing POPC, POPE, and POPS phospholipids in a 1:2:1 relation according to the experimental data (34). Interaction parameters for lipids were taken from Barrera et al. (35). The glycosylation trees were added/completed (in PDB ids 6VSB and 6XEY) using the Glycan Modeler & Reader utility from CHARMM-GUI (36). All parameters are available for download from the SIRAH force field web page (http://www.sirahff.com). Protonated structures were mapped to CG with SIRAH Tools (37). Solutes were centered in an octahedral box filled with pre-equilibrated SIRAH's CG water molecules named WT4 (38). An ionic strength of 0.15 M was set by randomly replacing WT4 molecules with Na+ and Cl CG ions (39). Since SIRAH uses a Hamiltonian common to any atomistic MD simulation, the 6–12 terms used to treat Lennard-Jones interactions might lead to large repulsions if initial structures suffer from clashes. Because of this, gentle initialization protocols aimed to resolve steric clashes are strongly recommended. The simulation protocol consisted of: 1) Solvent and side chains relaxation by 5,000 steps of energy minimization, imposing positional restraints of 2.4 kcal mol−1 Å−2 on backbone beads corresponding to the nitrogen and carbonylic oxygen (named GN and GO,...

This publication has 34 references indexed in Scilit: