Refine Search

New Search

Results in Journal Scientific Data: 1,786

(searched for: journal_id:(1215694))
Page of 36
Articles per Page
by
Show export options
  Select all
Published: 14 May 2021
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00913-y

Abstract:
Micro-CT provides critical data for musculoskeletal research, yielding three-dimensional datasets containing distributions of mineral density. Using high-resolution scans, we quantified changes in the fine architecture of bone in the spine of young mice. This data is made available as a reference to physiological cancellous bone growth. The scans (n = 19) depict the extensive structural changes typical for female C57BL/6 mice pups, aged 1-, 3-, 7-, 10- and 14-days post-partum, as they attain the mature geometry. We reveal the micro-morphology down to individual trabeculae in the spine that follow phases of mineral-tissue rearrangement in the growing lumbar vertebra on a micrometer length scale. Phantom data is provided to facilitate mineral density calibration. Conventional histomorphometry matched with our micro-CT data on selected samples confirms the validity and accuracy of our 3D scans. The data may thus serve as a reference for modeling normal bone growth and can be used to benchmark other experiments assessing the effects of biomaterials, tissue growth, healing, and regeneration.
, , Leyden Fernandez, Gaëtan Martin, Gustavo A. Martinez-Rodriguez, Jatta Saarenheimo, , ,
Published: 14 May 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00910-1

Abstract:
Stratified lakes and ponds featuring steep oxygen gradients are significant net sources of greenhouse gases and hotspots in the carbon cycle. Despite their significant biogeochemical roles, the microbial communities, especially in the oxygen depleted compartments, are poorly known. Here, we present a comprehensive dataset including 267 shotgun metagenomes from 41 stratified lakes and ponds mainly located in the boreal and subarctic regions, but also including one tropical reservoir and one temperate lake. For most lakes and ponds, the data includes a vertical sample set spanning from the oxic surface to the anoxic bottom layer. The majority of the samples were collected during the open water period, but also a total of 29 samples were collected from under the ice. In addition to the metagenomic sequences, the dataset includes environmental variables for the samples, such as oxygen, nutrient and organic carbon concentrations. The dataset is ideal for further exploring the microbial taxonomic and functional diversity in freshwater environments and potential climate change impacts on the functioning of these ecosystems.
Kai Furuya, Tao Wu, Ai Orimoto, Eriko Sugano, , , Takahiro Kurose, Yoshihiro Takai,
Published: 7 May 2021
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00908-9

Abstract:
Cellular immortalization enables indefinite expansion of cultured cells. However, the process of cell immortalization sometimes changes the original nature of primary cells. In this study, we performed expression profiling of poly A-tailed RNA from primary and immortalized corneal epithelial cells expressing Simian virus 40 large T antigen (SV40) or the combination of mutant cyclin-dependent kinase 4 (CDK4), cyclin D1, and telomere reverse transcriptase (TERT). Furthermore, we studied the expression profile of SV40 cells cultured in medium with or without serum. The profiling of whole expression pattern revealed that immortalized corneal epithelial cells with SV40 showed a distinct expression pattern from wild-type cells regardless of the presence or absence of serum, while corneal epithelial cells with combinatorial expression showed an expression pattern relatively closer to that of wild-type cells.
Tashina Petersson, , , Marta Antonelli, Katarzyna Dembska, , Alessandra Varotto,
Published: 7 May 2021
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00909-8

Abstract:
Informing and engaging citizens to adopt sustainable diets is a key strategy for reducing global environmental impacts of the agricultural and food sectors. In this respect, the first requisite to support citizens and actors of the food sector is to provide them a publicly available, reliable and ready to use synthesis of environmental pressures associated to food commodities. Here we introduce the SU-EATABLE LIFE database, a multilevel database of carbon (CF) and water (WF) footprint values of food commodities, based on a standardized methodology to extract information and assign optimal footprint values and uncertainties to food items, starting from peer-reviewed articles and grey literature. The database and its innovative methodological framework for uncertainty treatment and data quality assurance provides a solid basis for evaluating the impact of dietary shifts on global environmental policies, including climate mitigation through greenhouse gas emission reductions. The database ensures repeatability and further expansion, providing a reliable science-based tool for managers and researcher in the food sector.
Correction
, , Guido Lemoine, , , Ian McCallum, Hadi, Florian Kraxner, Frédéric Achard, Steffen Fritz
Published: 5 May 2021
Scientific Data, Volume 8, pp 1-1; doi:10.1038/s41597-021-00917-8

Correction
, , , Marwan Cheikh Albassatneh, Juan Arroyo, Gianluigi Bacchetta, Francesca Bagnoli, Zoltán Barina, Manuel Cartereau, , et al.
Published: 4 May 2021
Scientific Data, Volume 8, pp 1-1; doi:10.1038/s41597-021-00911-0

, Thomas Wahl
Published: 4 May 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00906-x

Abstract:
Storm surges are among the deadliest coastal hazards and understanding how they have been affected by climate change and variability in the past is crucial to prepare for the future. However, tide gauge records are often too short to assess trends and perform robust statistical analyses. Here we use a data-driven modeling framework to simulate daily maximum surge values at 882 tide gauge locations across the globe. We use five different atmospheric reanalysis products for the storm surge reconstruction, the longest one going as far back as 1836. The data that we generate can be used, for example, for long-term trend analyses of the storm surge climate and identification of regions where changes in the intensity and/or frequency of storms surges have occurred in the past. It also provides a better basis for robust extreme value analysis, especially for tide gauges where observational records are short. The data are made available for public use through an interactive web-map as well as a public data repository.
Published: 4 May 2021
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00905-y

Abstract:
Here, we describe a dataset with information about monogenic, rare diseases with a known genetic background, supplemented with manually extracted provenance for the disease itself and the discovery of the underlying genetic cause. We assembled a collection of 4166 rare monogenic diseases and linked them to 3163 causative genes, annotated with OMIM and Ensembl identifiers and HGNC symbols. The PubMed identifiers of the scientific publications, which for the first time described the rare diseases, and the publications, which found the genes causing the diseases were added using information from OMIM, PubMed, Wikipedia, whonamedit.com, and Google Scholar. The data are available under CC0 license as spreadsheet and as RDF in a semantic model modified from DisGeNET, and was added to Wikidata. This dataset relies on publicly available data and publications with a PubMed identifier, but by our effort to make the data interoperable and linked, we can now analyse this data. Our analysis revealed the timeline of rare disease and causative gene discovery and links them to developments in methods.
, Milan Kilibarda, Dragutin Protić,
Published: 30 April 2021
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00901-2

Abstract:
We produced the first daily gridded meteorological dataset at a 1-km spatial resolution across Serbia for 2000–2019, named MeteoSerbia1km. The dataset consists of five daily variables: maximum, minimum and mean temperature, mean sea-level pressure, and total precipitation. In addition to daily summaries, we produced monthly and annual summaries, and daily, monthly, and annual long-term means. Daily gridded data were interpolated using the Random Forest Spatial Interpolation methodology, based on using the nearest observations and distances to them as spatial covariates, together with environmental covariates to make a random forest model. The accuracy of the MeteoSerbia1km daily dataset was assessed using nested 5-fold leave-location-out cross-validation. All temperature variables and sea-level pressure showed high accuracy, although accuracy was lower for total precipitation, due to the discontinuity in its spatial distribution. MeteoSerbia1km was also compared with the E-OBS dataset with a coarser resolution: both datasets showed similar coarse-scale patterns for all daily meteorological variables, except for total precipitation. As a result of its high resolution, MeteoSerbia1km is suitable for further environmental analyses.
Parnian Afshar, Shahin Heidarian, Nastaran Enshaei, Farnoosh Naderkhani, Moezedin Javad Rafiee, , Faranak Babaki Fard, Kaveh Samimi, Konstantinos N. Plataniotis,
Published: 29 April 2021
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00900-3

Abstract:
Novel Coronavirus (COVID-19) has drastically overwhelmed more than 200 countries affecting millions and claiming almost 2 million lives, since its emergence in late 2019. This highly contagious disease can easily spread, and if not controlled in a timely fashion, can rapidly incapacitate healthcare systems. The current standard diagnosis method, the Reverse Transcription Polymerase Chain Reaction (RT- PCR), is time consuming, and subject to low sensitivity. Chest Radiograph (CXR), the first imaging modality to be used, is readily available and gives immediate results. However, it has notoriously lower sensitivity than Computed Tomography (CT), which can be used efficiently to complement other diagnostic methods. This paper introduces a new COVID-19 CT scan dataset, referred to as COVID-CT-MD, consisting of not only COVID-19 cases, but also healthy and participants infected by Community Acquired Pneumonia (CAP). COVID-CT-MD dataset, which is accompanied with lobe-level, slice-level and patient-level labels, has the potential to facilitate the COVID-19 research, in particular COVID-CT-MD can assist in development of advanced Machine Learning (ML) and Deep Neural Network (DNN) based solutions.
Dheeraj Rathee, , Sujit Roy,
Published: 29 April 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00899-7

Abstract:
Recent advancements in magnetoencephalography (MEG)-based brain-computer interfaces (BCIs) have shown great potential. However, the performance of current MEG-BCI systems is still inadequate and one of the main reasons for this is the unavailability of open-source MEG-BCI datasets. MEG systems are expensive and hence MEG datasets are not readily available for researchers to develop effective and efficient BCI-related signal processing algorithms. In this work, we release a 306-channel MEG-BCI data recorded at 1KHz sampling frequency during four mental imagery tasks (i.e. hand imagery, feet imagery, subtraction imagery, and word generation imagery). The dataset contains two sessions of MEG recordings performed on separate days from 17 healthy participants using a typical BCI imagery paradigm. The current dataset will be the only publicly available MEG imagery BCI dataset as per our knowledge. The dataset can be used by the scientific community towards the development of novel pattern recognition machine learning methods to detect brain activities related to motor imagery and cognitive imagery tasks using MEG signals.
, Zijing Dong, , Congyu Liao, Qiuyun Fan, W. Scott Hoge, Boris Keil, , Lawrence L. Wald, , et al.
Published: 29 April 2021
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00904-z

Abstract:
We present a whole-brain in vivo diffusion MRI (dMRI) dataset acquired at 760 μm isotropic resolution and sampled at 1260 q-space points across 9 two-hour sessions on a single healthy participant. The creation of this benchmark dataset is possible through the synergistic use of advanced acquisition hardware and software including the high-gradient-strength Connectom scanner, a custom-built 64-channel phased-array coil, a personalized motion-robust head stabilizer, a recently developed SNR-efficient dMRI acquisition method, and parallel imaging reconstruction with advanced ghost reduction algorithm. With its unprecedented resolution, SNR and image quality, we envision that this dataset will have a broad range of investigational, educational, and clinical applications that will advance the understanding of human brain structures and connectivity. This comprehensive dataset can also be used as a test bed for new modeling, sub-sampling strategies, denoising and processing algorithms, potentially providing a common testing platform for further development of in vivo high resolution dMRI techniques. Whole brain anatomical T1-weighted and T2-weighted images at submillimeter scale along with field maps are also made available.
Uxue Ulanga, Matthew Russell, Stefano Patassini, Julie Brazzatti, Ciaren Graham, Anthony D. Whetton, Robert L. J. Graham
Published: 26 April 2021
Scientific Data, Volume 8, pp 1-9; doi:10.1038/s41597-021-00896-w

Abstract:
Murine models are amongst the most widely used systems to study biology and pathology. Targeted quantitative proteomic analysis is a relatively new tool to interrogate such systems. Recently the need for relative quantification on hundreds to thousands of samples has driven the development of Data Independent Acquisition methods. One such technique is SWATH-MS, which in the main requires prior acquisition of mass spectra to generate an assay reference library. In stem cell research, it has been shown pluripotency can be induced starting with a fibroblast population. In so doing major changes in expressed proteins is inevitable. Here we have created a reference library to underpin such studies. This is inclusive of an extensively documented script to enable replication of library generation from the raw data. The documented script facilitates reuse of data and adaptation of the library to novel applications. The resulting library provides deep coverage of the mouse proteome. The library covers 29519 proteins (53% of the proteome) of which 7435 (13%) are supported by a proteotypic peptide.
Published: 23 April 2021
Scientific Data, Volume 8, pp 1-14; doi:10.1038/s41597-021-00890-2

Abstract:
Using 11 proteomics datasets, mostly available through the PRIDE database, we assembled a reference expression map for 191 cancer cell lines and 246 clinical tumour samples, across 13 lineages. We found unique peptides identified only in tumour samples despite a much higher coverage in cell lines. These were mainly mapped to proteins related to regulation of signalling receptor activity. Correlations between baseline expression in cell lines and tumours were calculated. We found these to be highly similar across all samples with most similarity found within a given sample type. Integration of proteomics and transcriptomics data showed median correlation across cell lines to be 0.58 (range between 0.43 and 0.66). Additionally, in agreement with previous studies, variation in mRNA levels was often a poor predictor of changes in protein abundance. To our knowledge, this work constitutes the first meta-analysis focusing on cancer-related public proteomics datasets. We therefore also highlight shortcomings and limitations of such studies. All data is available through PRIDE dataset identifier PXD013455 and in Expression Atlas.
Published: 23 April 2021
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00898-8

Abstract:
A critical shortage of ‘big’ agronomic data is placing an unnecessary constraint on the conduct of public agronomic research, imparting barriers to model development and testing. Here, we address this problem by providing a large non-relational database of agronomic trials, linked to intensive management and observational data, run under a unified experimental framework. The National Variety Trials (NVTs) represent a decade-long experimental trial network, conducted across thousands of Australian field sites using highly standardised randomised controlled designs. The NVTs contain over a million machine-measured phenotypic observations, aggregated from density-controlled populations containing hundreds of millions of plants and thousands of released plant varieties. These data are linked to hundreds of thousands of metadata observations including standardised soil tests, fertiliser and pesticide input data, crop rotation data, prior farm management practices, and in-field sensors. Finally, these data are linked to a suite of ground and remote sensing observations, arranged into interpolated daily- and ten-day aggregated time series, to capture the substantial diversity in vegetation and environmental patterns across the continent-spanning NVT network.
Published: 23 April 2021
Scientific Data, Volume 8, pp 1-11; doi:10.1038/s41597-021-00897-9

Abstract:
Human settlements are usually nucleated around manmade central points or distinctive natural features, forming clusters that vary in shape and size. However, population distribution in geo-sciences is often represented in the form of pixelated rasters. Rasters indicate population density at predefined spatial resolutions, but are unable to capture the actual shape or size of settlements. Here we suggest a methodology that translates high-resolution raster population data into vector-based population clusters. We use open-source data and develop an open-access algorithm tailored for low and middle-income countries with data scarcity issues. Each cluster includes unique characteristics indicating population, electrification rate and urban-rural categorization. Results are validated against national electrification rates provided by the World Bank and data from selected Demographic and Health Surveys (DHS). We find that our modeled national electrification rates are consistent with the rates reported by the World Bank, while the modeled urban/rural classification has 88% accuracy. By delineating settlements, this dataset can complement existing raster population data in studies such as energy planning, urban planning and disease response.
Correction
, , Camilla Ceccarani, Emily Fontana, Luigi A. Amoretti, Roberta J. Wright, , , Ying Taur, Miguel-Angel Perales, et al.
Published: 23 April 2021
Scientific Data, Volume 8, pp 1-1; doi:10.1038/s41597-021-00903-0

Abstract:
A Correction to this paper has been published: https://doi.org/10.1038/s41597-021-00903-0.
, Undiagnosed Diseases Network, , Erika M. Zink, , Kent J. Bloodsworth, , , , , et al.
Published: 21 April 2021
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00894-y

Abstract:
Every year individuals experience symptoms that remain undiagnosed by healthcare providers. In the United States, these rare diseases are defined as a condition that affects fewer than 200,000 individuals. However, there are an estimated 7000 rare diseases, and there are an estimated 25–30 million Americans in total (7.6–9.2% of the population as of 2018) affected by such disorders. The NIH Common Fund Undiagnosed Diseases Network (UDN) seeks to provide diagnoses for individuals with undiagnosed disease. Mass spectrometry-based metabolomics and lipidomics analyses could advance the collective understanding of individual symptoms and advance diagnoses for individuals with heretofore undiagnosed disease. Here, we report the mass spectrometry-based metabolomics and lipidomics analyses of blood plasma, urine, and cerebrospinal fluid from 148 patients within the UDN and their families, as well as from a reference population of over 100 individuals with no known metabolic diseases. The raw and processed data are available to the research community so that they might be useful in the diagnoses of current or future patients suffering from undiagnosed disorders.
Tae Woong Whon, Seung Woo Ahn, Sungjin Yang, Joon Yong Kim, Yeon Bee Kim, Yujin Kim, Ji-Man Hong, Hojin Jung, Yoon-E Choi, , et al.
Published: 20 April 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00895-x

Abstract:
ODFM is a data management system that integrates comprehensive omics information for microorganisms associated with various fermented foods, additive ingredients, and seasonings (e.g. kimchi, Korean fermented vegetables, fermented seafood, solar salt, soybean paste, vinegar, beer, cheese, sake, and yogurt). The ODFM archives genome, metagenome, metataxonome, and (meta)transcriptome sequences of fermented food-associated bacteria, archaea, eukaryotic microorganisms, and viruses; 131 bacterial, 38 archaeal, and 28 eukaryotic genomes are now available to users. The ODFM provides both the Basic Local Alignment Search Tool search-based local alignment function as well as average nucleotide identity-based genetic relatedness measurement, enabling gene diversity and taxonomic analyses of an input query against the database. Genome sequences and annotation results of microorganisms are directly downloadable, and the microbial strains registered in the archive library will be available from our culture collection of fermented food-associated microorganisms. The ODFM is a comprehensive database that covers the genomes of an entire microbiome within a specific food ecosystem, providing basic information to evaluate microbial isolates as candidate fermentation starters for fermented food production.
, Ellen M. Considine, Melissa M. Maestas, Gina Li
Published: 19 April 2021
Scientific Data, Volume 8, pp 1-15; doi:10.1038/s41597-021-00891-1

Abstract:
We created daily concentration estimates for fine particulate matter (PM2.5) at the centroids of each county, ZIP code, and census tract across the western US, from 2008–2018. These estimates are predictions from ensemble machine learning models trained on 24-hour PM2.5 measurements from monitoring station data across 11 states in the western US. Predictor variables were derived from satellite, land cover, chemical transport model (just for the 2008–2016 model), and meteorological data. Ten-fold spatial and random CV R2 were 0.66 and 0.73, respectively, for the 2008–2016 model and 0.58 and 0.72, respectively for the 2008–2018 model. Comparing areal predictions to nearby monitored observations demonstrated overall R2 of 0.70 for the 2008–2016 model and 0.58 for the 2008–2018 model, but we observed higher R2 (>0.80) in many urban areas. These data can be used to understand spatiotemporal patterns of, exposures to, and health impacts of PM2.5 in the western US, where PM2.5 levels have been heavily impacted by wildfire smoke over this time period.
, Edit Herczog, , Keith Russell,
Published: 16 April 2021
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00892-0

Abstract:
As big data, open data, and open science advance to increase access to complex and large datasets for innovation, discovery, and decision-making, Indigenous Peoples’ rights to control and access their data within these data environments remain limited. Operationalizing the FAIR Principles for scientific data with the CARE Principles for Indigenous Data Governance enhances machine actionability and brings people and purpose to the fore to resolve Indigenous Peoples’ rights to and interests in their data across the data lifecycle.
, Roberta Bardelli,
Published: 16 April 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00888-w

Abstract:
The Atlantic blue crab Callinectes sapidus is a portunid native to the western Atlantic, from New England to Uruguay. The species was introduced in Europe in 1901 where it has become invasive; additionally, a significant northward expansion has been emphasized in its native range. Here we present a harmonized global compilation of C. sapidus occurrences from native and non-native distribution ranges derived from online databases (GBIF, BISON, OBIS, and iNaturalist) as well as from unpublished and published sources. The dataset consists of 40,388 geo-referenced occurrences, 39,824 from native and 564 from non-native ranges, recorded in 53 countries. The implementation of quality controls imposed a severe reduction, in particular from online databases, of the records selected for inclusion in the dataset. In addition, a technical validation procedure was used to flag entries showing identical coordinates but different year of record, in-land occurrences and those located close to the coast. Similarly, a flagging system identified entries outside the known distribution of the species, or associated with unsuccessful introductions.
Published: 16 April 2021
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00893-z

Abstract:
Deep learning approaches for tomographic image reconstruction have become very effective and have been demonstrated to be competitive in the field. Comparing these approaches is a challenging task as they rely to a great extent on the data and setup used for training. With the Low-Dose Parallel Beam (LoDoPaB)-CT dataset, we provide a comprehensive, open-access database of computed tomography images and simulated low photon count measurements. It is suitable for training and comparing deep learning methods as well as classical reconstruction approaches. The dataset contains over 40000 scan slices from around 800 patients selected from the LIDC/IDRI database. The data selection and simulation setup are described in detail, and the generating script is publicly accessible. In addition, we provide a Python library for simplified access to the dataset and an online reconstruction challenge. Furthermore, the dataset can also be used for transfer learning as well as sparse and limited-angle reconstruction scenarios.
Published: 16 April 2021
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00889-9

Abstract:
Detailed descriptions of microbial communities have lagged far behind physical and chemical measurements in the marine environment. Here, we present 971 globally distributed surface ocean metagenomes collected at high spatio-temporal resolution. Our low-cost metagenomic sequencing protocol produced 3.65 terabases of data, where the median number of base pairs per sample was 3.41 billion. The median distance between sampling stations was 26 km. The metagenomic libraries described here were collected as a part of a biological initiative for the Global Ocean Ship-based Hydrographic Investigations Program, or “Bio-GO-SHIP.” One of the primary aims of GO-SHIP is to produce high spatial and vertical resolution measurements of key state variables to directly quantify climate change impacts on ocean environments. By similarly collecting marine metagenomes at high spatiotemporal resolution, we expect that this dataset will help answer questions about the link between microbial communities and biogeochemical fluxes in a changing ocean.
, Eric C. Fields, Elizabeth A. Kensinger
Published: 16 April 2021
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00886-y

Abstract:
While there was a necessary initial focus on physical health consequences of the COVID-19 pandemic, it is becoming increasingly clear that many have experienced significant social and mental health repercussions as well. It is important to understand the effects of the pandemic on well-being, both as the world continues to recover from the lasting impact of COVID-19 and in the eventual case of future pandemics. On March 20, 2020, we launched an online daily survey study tracking participants’ sleep and mental well-being. Repeated reports of sleep and mental health metrics were collected from participants ages 18–90 during the initial wave of the pandemic (March 20 – June 23, 2020). Given both the comprehensive nature and early start of this assessment, open access to this dataset will allow researchers to answer a range of questions regarding the psychiatric impact of the COVID-19 pandemic and the fallout left in its wake.
, Roberta Bottarin
Published: 13 April 2021
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00887-x

Abstract:
The present dataset contains information about aquatic macroinvertebrates and environmental variables collected before and after the implementation of a small “run-of-river” hydropower plant on the Saldur stream, a glacier-fed stream located in the Italian Central-Eastern Alps. Between 2015 and 2019, with two sampling events per year, we collected and identified 34,836 organisms in 6 sampling sites located within a 6 km stretch of the stream. Given the current boom of the hydropower sector worldwide, and the growing contribution of small hydropower plants to energy production, data here included may represent an important – and long advocated – baseline to assess the effects that these kinds of powerplants have on the riverine ecosystem. Moreover, since the Saldur stream is part of the International Long Term Ecological Research network, this dataset also constitutes part of the data gathered within this research programme. All samples are preserved at Eurac Research facilities.
Kevin F. Garrity,
Published: 13 April 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00885-z

Abstract:
Wannier tight-binding Hamiltonians (WTBH) provide a computationally efficient way to predict electronic properties of materials. In this work, we develop a computational workflow for high-throughput Wannierization of density functional theory (DFT) based electronic band structure calculations. We apply this workflow to 1771 materials (1406 3D and 365 2D), and we create a database with the resulting WTBHs. We evaluate the accuracy of the WTBHs by comparing the Wannier band structures to directly calculated spin-orbit coupling DFT band structures. Our testing includes k-points outside the grid used in the Wannierization, providing an out-of-sample test of accuracy. We illustrate the use of WTBHs with a few example applications. We also develop a web-app that can be used to predict electronic properties on-the-fly using WTBH from our database. The tools to generate the Hamiltonian and the database of the WTB parameters are made publicly available through the websites https://github.com/usnistgov/jarvis and https://jarvis.nist.gov/jarviswtb.
, Martin Wagner, , Annika Reinke, Sebastian Bodenstedt, Peter M. Full, Hellena Hempe, Diana Mindroc-Filimon, Patrick Scholz, Thuy Nuong Tran, et al.
Published: 12 April 2021
Scientific Data, Volume 8, pp 1-11; doi:10.1038/s41597-021-00882-2

Abstract:
Image-based tracking of medical instruments is an integral part of surgical data science applications. Previous research has addressed the tasks of detecting, segmenting and tracking medical instruments based on laparoscopic video data. However, the proposed methods still tend to fail when applied to challenging images and do not generalize well to data they have not been trained on. This paper introduces the Heidelberg Colorectal (HeiCo) data set - the first publicly available data set enabling comprehensive benchmarking of medical instrument detection and segmentation algorithms with a specific emphasis on method robustness and generalization capabilities. Our data set comprises 30 laparoscopic videos and corresponding sensor data from medical devices in the operating room for three different types of laparoscopic surgery. Annotations include surgical phase labels for all video frames as well as information on instrument presence and corresponding instance-wise segmentation masks for surgical instruments (if any) in more than 10,000 individual frames. The data has successfully been used to organize international competitions within the Endoscopic Vision Challenges 2017 and 2019.
Prangwan Pateetin, Gyorgy Hutvagner, Sarah Bajan, Matthew P. Padula, Eileen M. McGowan,
Published: 12 April 2021
Scientific Data, Volume 8, pp 1-11; doi:10.1038/s41597-021-00884-0

Abstract:
Progesterone receptor (PR) isoforms, PRA and PRB, act in a progesterone-independent and dependent manner to differentially modulate the biology of breast cancer cells. Here we show that the differences in PRA and PRB structure facilitate the binding of common and distinct protein interacting partners affecting the downstream signaling events of each PR-isoform. Tet-inducible HA-tagged PRA or HA-tagged PRB constructs were expressed in T47DC42 (PR/ER negative) breast cancer cells. Affinity purification coupled with stable isotope labeling of amino acids in cell culture (SILAC) mass spectrometry technique was performed to comprehensively study PRA and PRB interacting partners in both unliganded and liganded conditions. To validate our findings, we applied both forward and reverse SILAC conditions to effectively minimize experimental errors. These datasets will be useful in investigating PRA- and PRB-specific molecular mechanisms and as a database for subsequent experiments to identify novel PRA and PRB interacting proteins that differentially mediated different biological functions in breast cancer.
Published: 12 April 2021
Scientific Data, Volume 8, pp 1-11; doi:10.1038/s41597-021-00881-3

Abstract:
Understanding the lower limb kinematic, kinetic, and electromyography (EMG) data interrelation in controlled speeds is challenging for fully assessing human locomotion conditions. This paper provides a complete dataset with the above-mentioned raw and processed data simultaneously recorded for sixteen healthy participants walking on a 10 meter-flat surface at seven controlled speeds (1.0, 1.5, 2.0, 2.5, 3.0, 3.5, and 4.0 km/h). The raw data include 3D joint trajectories of 24 retro-reflective markers, ground reaction forces (GRF), force plate moments, center of pressures, and EMG signals from Tibialis Anterior, Gastrocnemius Lateralis, Biceps Femoris, and Vastus Lateralis. The processed data present gait cycle-normalized data including filtered EMG signals and their envelope, 3D GRF, joint angles, and torques. This study details the experimental setup and presents a brief validation of the data quality. The presented dataset may contribute to (i) validate and enhance human biomechanical gait models, and (ii) serve as a reference trajectory for personalized control of robotic assistive devices, aiming an adequate assistance level adjusted to the gait speed and user’s anthropometry.
, Ajay Singh Nagpure,
Published: 12 April 2021
Scientific Data, Volume 8, pp 1-13; doi:10.1038/s41597-021-00853-7

Abstract:
India is the third-largest contributor to global energy-use and anthropogenic carbon emissions. India’s urban energy transitions are critical to meet its climate goals due to the country’s rapid urbanization. However, no baseline urban energy-use dataset covers all Indian urban districts in ways that align with national totals and integrate social-economic-infrastructural attributes to inform such transitions. This paper develops a novel bottom-up plus top-down approach, comprehensively integrating multiple field surveys and utilizing machine learning, to model All Urban areas’ Energy-use (AllUrE) across all 640 districts in India, merged with social-economic-infrastructural data. Energy use estimates in this AllUrE-India dataset are evaluated by comparing with reported energy-use at three scales: nation-wide, state-wide, and city-level. Spatially granular AllUrE data aggregated nationally show good agreement with national totals (<2% difference). The goodness-of-fit ranged from 0.78–0.95 for comparison with state-level totals, and 0.90–0.99 with city-level data for different sectors. The relatively strong alignment at all three spatial scales demonstrates the value of AllUrE-India data for modelling urban energy transitions consistent with national energy and climate goals.
Shahzad Ahmed, Dingyang Wang, Junyoung Park,
Published: 12 April 2021
Scientific Data, Volume 8, pp 1-9; doi:10.1038/s41597-021-00876-0

Abstract:
In the past few decades, deep learning algorithms have become more prevalent for signal detection and classification. To design machine learning algorithms, however, an adequate dataset is required. Motivated by the existence of several open-source camera-based hand gesture datasets, this descriptor presents UWB-Gestures, the first public dataset of twelve dynamic hand gestures acquired with ultra-wideband (UWB) impulse radars. The dataset contains a total of 9,600 samples gathered from eight different human volunteers. UWB-Gestures eliminates the need to employ UWB radar hardware to train and test the algorithm. Additionally, the dataset can provide a competitive environment for the research community to compare the accuracy of different hand gesture recognition (HGR) algorithms, enabling the provision of reproducible research results in the field of HGR through UWB radars. Three radars were placed at three different locations to acquire the data, and the respective data were saved independently for flexibility.
, Cristiana Y. Antonino, Farallon J. Broughton, Lauren N. Dykman, Armand M. Kuris, Kevin D. Lafferty
Published: 8 April 2021
Scientific Data, Volume 8, pp 1-14; doi:10.1038/s41597-021-00880-4

Abstract:
We built a high-resolution topological food web for the kelp forests of the Santa Barbara Channel, California, USA that includes parasites and significantly improves resolution compared to previous webs. The 1,098 nodes and 21,956 links in the web describe an economically, socially, and ecologically vital system. Nodes are broken into life-stages, with 549 free-living life-stages (492 species from 21 Phyla) and 549 parasitic life-stages (450 species from 10 Phyla). Links represent three kinds of trophic interactions, with 9,352 predator-prey links, 2,733 parasite-host links and 9,871 predator-parasite links. All decisions for including nodes and links are documented, and extensive metadata in the node list allows users to filter the node list to suit their research questions. The kelp-forest food web is more species-rich than any other published food web with parasites, and it has the largest proportion of parasites. Our food web may be used to predict how kelp forests may respond to change, will advance our understanding of parasites in ecosystems, and fosters development of theory that incorporates large networks.
, , Jake Andrae, Nina Welti, Greg R. Guerin, Emrys Leitch, Tony Hall, Steve Szarvas, Rachel Atkins, Stefan Caddy-Retalic, et al.
Published: 1 April 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00877-z

Abstract:
The photosynthetic pathway of plants is a fundamental trait that influences terrestrial environments from the local to global level. The distribution of different photosynthetic pathways in Australia is expected to undergo a substantial shift due to climate change and rising atmospheric CO2; however, tracking change is hindered by a lack of data on the pathways of species, as well as their distribution and relative cover within plant communities. Here we present the photosynthetic pathways for 2428 species recorded across 541 plots surveyed by Australia’s Terrestrial Ecosystem Research Network (TERN) between 2011 and 2017. This dataset was created to facilitate research exploring trends in vegetation change across Australia. Species were assigned a photosynthetic pathway using published literature and stable carbon isotope analysis of bulk tissue. The photosynthetic pathway of species can be extracted from the dataset individually, or used in conjunction with vegetation surveys to study the occurrence and abundance of pathways across the continent. This dataset will be updated as TERN’s plot network expands and new information becomes available.
James R. Stieger, Stephen A. Engel,
Published: 1 April 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00883-1

Abstract:
Brain computer interfaces (BCIs) are valuable tools that expand the nature of communication through bypassing traditional neuromuscular pathways. The non-invasive, intuitive, and continuous nature of sensorimotor rhythm (SMR) based BCIs enables individuals to control computers, robotic arms, wheel-chairs, and even drones by decoding motor imagination from electroencephalography (EEG). Large and uniform datasets are needed to design, evaluate, and improve the BCI algorithms. In this work, we release a large and longitudinal dataset collected during a study that examined how individuals learn to control SMR-BCIs. The dataset contains over 600 hours of EEG recordings collected during online and continuous BCI control from 62 healthy adults, (mostly) right hand dominant participants, across (up to) 11 training sessions per participant. The data record consists of 598 recording sessions, and over 250,000 trials of 4 different motor-imagery-based BCI tasks. The current dataset presents one of the largest and most complex SMR-BCI datasets publicly available to date and should be useful for the development of improved algorithms for BCI control.
, , Guido Lemoine, , , Ian McCallum, Hadi, Florian Kraxner, Frédéric Achard, Steffen Fritz
Published: 30 March 2021
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00867-1

Abstract:
In recent decades, global oil palm production has shown an abrupt increase, with almost 90% produced in Southeast Asia alone. To understand trends in oil palm plantation expansion and for landscape-level planning, accurate maps are needed. Although different oil palm maps have been produced using remote sensing in the past, here we use Sentinel 1 imagery to generate an oil palm plantation map for Indonesia, Malaysia and Thailand for the year 2017. In addition to location, the age of the oil palm plantation is critical for calculating yields. Here we have used a Landsat time series approach to determine the year in which the oil palm plantations are first detected, at which point they are 2 to 3 years of age. From this, the approximate age of the oil palm plantation in 2017 can be derived.
Published: 26 March 2021
Scientific Data, Volume 8, pp 1-13; doi:10.1038/s41597-021-00862-6

Abstract:
The Gravity Recovery And Climate Experiment (GRACE) satellite mission recorded temporal variations in the Earth’s gravity field, which are then converted to Total Water Storage Change (TWSC) fields representing an anomaly in the water mass stored in all three physical states, on and below the surface of the Earth. GRACE provided a first global observational record of water mass redistribution at spatial scales greater than 63000 km2. This limits their usability in regional hydrological applications. In this study, we implement a statistical downscaling approach that assimilates 0.5° × 0.5° water storage fields from the WaterGAP hydrology model (WGHM), precipitation fields from 3 models, evapotranspiration and runoff from 2 models, with GRACE data to obtain TWSC at a 0.5° × 0.5° grid. The downscaled product exploits dominant common statistical modes between all the hydrological datasets to improve the spatial resolution of GRACE. We also provide open access to scripts that researchers can use to produce downscaled TWSC fields with input observations and models of their own choice.
, Eva Šilarová, Jana Škorpilová, Hany Alonso, Marc Anton, Ainars Aunins, Zoltán Benkö, Gilles Biver, Malte Busch, Tomasz Chodkiewicz, et al.
Published: 26 March 2021
Scientific Data, Volume 8, pp 1-9; doi:10.1038/s41597-021-00804-2

Abstract:
Around fifteen thousand fieldworkers annually count breeding birds using standardized protocols in 28 European countries. The observations are collected by using country-specific and standardized protocols, validated, summarized and finally used for the production of continent-wide annual and long-term indices of population size changes of 170 species. Here, we present the database and provide a detailed summary of the methodology used for fieldwork and calculation of the relative population size change estimates. We also provide a brief overview of how the data are used in research, conservation and policy. We believe this unique database, based on decades of bird monitoring alongside the comprehensive summary of its methodology, will facilitate and encourage further use of the Pan-European Common Bird Monitoring Scheme results.
Rafael Fogaça de Almeida, Matheus Fernandes,
Published: 25 March 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00818-w

Abstract:
In humans and other eukaryotes, histone post-translational modifications (hPTMs) play an essential role in the epigenetic control of gene expression. In trypanosomatid parasites, conversely, gene regulation occurs mainly at the post-transcriptional level. However, our group has recently shown that hPTMs are abundant and varied in Trypanosoma cruzi, the etiological agent of Chagas Disease, signaling for possible conserved epigenetic functions. Here, we applied an optimized mass spectrometry-based proteomic workflow to provide a high-confidence comprehensive map of hPTMs, distributed in all canonical, variant and linker histones of T. cruzi. Our work expands the number of known T. cruzi hPTMs by almost 2-fold, representing the largest dataset of hPTMs available to any trypanosomatid to date, and can be used as a basis for functional studies on the dynamic regulation of chromatin by epigenetic mechanisms and the selection of candidates for the development of epigenetic drugs against trypanosomatids.
, Ching-Huei Tsou, Ananya Poddar, , , Piyush Madan, Anshul Agrawal, Charles Wachira, Osebe Mogaka Samuel, Osnat Bar-Shira, et al.
Published: 25 March 2021
Scientific Data, Volume 8, pp 1-14; doi:10.1038/s41597-021-00878-y

Abstract:
The Coronavirus disease 2019 (COVID-19) global pandemic has transformed almost every facet of human society throughout the world. Against an emerging, highly transmissible disease, governments worldwide have implemented non-pharmaceutical interventions (NPIs) to slow the spread of the virus. Examples of such interventions include community actions, such as school closures or restrictions on mass gatherings, individual actions including mask wearing and self-quarantine, and environmental actions such as cleaning public facilities. We present the Worldwide Non-pharmaceutical Interventions Tracker for COVID-19 (WNTRAC), a comprehensive dataset consisting of over 6,000 NPIs implemented worldwide since the start of the pandemic. WNTRAC covers NPIs implemented across 261 countries and territories, and classifies NPIs into a taxonomy of 16 NPI types. NPIs are automatically extracted daily from Wikipedia articles using natural language processing techniques and then manually validated to ensure accuracy and veracity. We hope that the dataset will prove valuable for policymakers, public health leaders, and researchers in modeling and analysis efforts to control the spread of COVID-19.
, , Ismini Lourentzou, Joy T. Wu, Arjun Sharma, Matthew Tong, Shafiq Abedin, David Beymer, Vandana Mukherjee, , et al.
Published: 25 March 2021
Scientific Data, Volume 8, pp 1-18; doi:10.1038/s41597-021-00863-5

Abstract:
We developed a rich dataset of Chest X-Ray (CXR) images to assist investigators in artificial intelligence. The data were collected using an eye-tracking system while a radiologist reviewed and reported on 1,083 CXR images. The dataset contains the following aligned data: CXR image, transcribed radiology report text, radiologist’s dictation audio and eye gaze coordinates data. We hope this dataset can contribute to various areas of research particularly towards explainable and multimodal deep learning/machine learning methods. Furthermore, investigators in disease classification and localization, automated radiology report generation, and human-machine interaction can benefit from these data. We report deep learning experiments that utilize the attention maps produced by the eye gaze dataset to show the potential utility of this dataset.
Linda Milne, , Paulo Rapazote-Flores, Claus-Dieter Mayer, ,
Published: 25 March 2021
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00872-4

Abstract:
A high-quality, barley gene reference transcript dataset (BaRTv1.0), was used to quantify gene and transcript abundances from 22 RNA-seq experiments, covering 843 separate samples. Using the abundance data we developed a Barley Expression Database (EORNA*) to underpin a visualisation tool that displays comparative gene and transcript abundance data on demand as transcripts per million (TPM) across all samples and all the genes. EORNA provides gene and transcript models for all of the transcripts contained in BaRTV1.0, and these can be conveniently identified through either BaRT or HORVU gene names, or by direct BLAST of query sequences. Browsing the quantification data reveals cultivar, tissue and condition specific gene expression and shows changes in the proportions of individual transcripts that have arisen via alternative splicing. TPM values can be easily extracted to allow users to determine the statistical significance of observed transcript abundance variation among samples or perform meta analyses on multiple RNA-seq experiments. * Eòrna is the Scottish Gaelic word for Barley.
Rezarta Islamaj, Robert Leaman, Sun Kim, Dongseop Kwon, Chih-Hsuan Wei, Donald C. Comeau, , David Cissel, Cathleen Coss, Carol Fisher, et al.
Published: 25 March 2021
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00875-1

Abstract:
Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.
, , , Marwan Cheikh Albassatneh, Juan Arroyo, Gianluigi Bacchetta, Francesca Bagnoli, Zoltán Barina, Manuel Cartereau, , et al.
Published: 23 March 2021
Scientific Data, Volume 8, pp 1-11; doi:10.1038/s41597-021-00873-3

Abstract:
Trees play a key role in the structure and function of many ecosystems worldwide. In the Mediterranean Basin, forests cover approximately 22% of the total land area hosting a large number of endemics (46 species). Despite its particularities and vulnerability, the biodiversity of Mediterranean trees is not well known at the taxonomic, spatial, functional, and genetic levels required for conservation applications. The WOODIV database fills this gap by providing reliable occurrences, four functional traits (plant height, seed mass, wood density, and specific leaf area), and sequences from three DNA-regions (rbcL, matK, and trnH-psbA), together with modelled occurrences and a phylogeny for all 210 Euro-Mediterranean tree species. We compiled, homogenized, and verified occurrence data from sparse datasets and collated them on an INSPIRE-compliant 10 × 10 km grid. We also gathered functional trait and genetic data, filling existing gaps where possible. The WOODIV database can benefit macroecological studies in the fields of conservation, biogeography, and community ecology.
, Paolo Boncio, Bruno Pace, Gerald Roberts, Lucilla Benedetti, , Francesco Visini,
Published: 22 March 2021
Scientific Data, Volume 8, pp 1-20; doi:10.1038/s41597-021-00868-0

Abstract:
We present a database of field data for active faults in the central Apennines, Italy, including trace, fault and main fault locations with activity and location certainties, and slip-rate, slip-vector and surface geometry data. As advances occur in our capability to create more detailed fault-based hazard models, depending on the availability of primary data and observations, it is desirable that such data can be organized in a way that is easily understood and incorporated into present and future models. The database structure presented herein aims to assist this process. We recommend stating what observations have led to different location and activity certainty and presenting slip-rate data with point location coordinates of where the data were collected with the time periods over which they were calculated. Such data reporting allows more complete uncertainty analyses in hazard and risk modelling. The data and maps are available as kmz, kml, and geopackage files with the data presented in spreadsheet files and the map coordinates as txt files. The files are available at: 10.1594/PANGAEA.922582.
, Maite M. van der Miesen, Tinka Beemsterboer, Andries van der Leij, Annemarie Eigenhuis,
Published: 19 March 2021
Scientific Data, Volume 8, pp 1-23; doi:10.1038/s41597-021-00870-6

Abstract:
We present the Amsterdam Open MRI Collection (AOMIC): three datasets with multimodal (3 T) MRI data including structural (T1-weighted), diffusion-weighted, and (resting-state and task-based) functional BOLD MRI data, as well as detailed demographics and psychometric variables from a large set of healthy participants (N = 928, N = 226, and N = 216). Notably, task-based fMRI was collected during various robust paradigms (targeting naturalistic vision, emotion perception, working memory, face perception, cognitive conflict and control, and response inhibition) for which extensively annotated event-files are available. For each dataset and data modality, we provide the data in both raw and preprocessed form (both compliant with the Brain Imaging Data Structure), which were subjected to extensive (automated and manual) quality control. All data is publicly available from the OpenNeuro data sharing platform.
Correction
, Nicholas Kurtansky, , Liam Caffery, Emmanouil Chousakos, Noel Codella, Marc Combalia, Stephen Dusza, Pascale Guitera, David Gutman, et al.
Published: 16 March 2021
Scientific Data, Volume 8, pp 1-1; doi:10.1038/s41597-021-00879-x

Abstract:
A Correction to this paper has been published: https://doi.org/10.1038/s41597-021-00879-x.
, Eva M. Kovacs, Kathryn Markey, Julie Vercelloni, Alberto Rodriguez-Ramirez, , Manuel Gonzalez-Rivero, ,
Published: 16 March 2021
Scientific Data, Volume 8, pp 1-7; doi:10.1038/s41597-021-00871-5

Abstract:
This paper describes benthic coral reef community composition point-based field data sets derived from georeferenced photoquadrats using machine learning. Annually over a 17 year period (2002–2018), data were collected using downward-looking photoquadrats that capture an approximately 1 m2 footprint along 100 m–1500 m transect surveys distributed along the reef slope and across the reef flat of Heron Reef (28 km2), Southern Great Barrier Reef, Australia. Benthic community composition for the photoquadrats was automatically interpreted through deep learning, following initial manual calibration of the algorithm. The resulting data sets support understanding of coral reef biology, ecology, mapping and dynamics. Similar methods to derive the benthic data have been published for seagrass habitats, however here we have adapted the methods for application to coral reef habitats, with the integration of automatic photoquadrat analysis. The approach presented is globally applicable for various submerged and benthic community ecological applications, and provides the basis for further studies at this site, regional to global comparative studies, and for the design of similar monitoring programs elsewhere.
Page of 36
Articles per Page
by
Show export options
  Select all
Back to Top Top