Refine Search

New Search

Results in Journal Scientific Data: 1,756

(searched for: journal_id:(1215694))
Page of 36
Articles per Page
by
Show export options
  Select all
Yiming Chen, Chi Chen, Chen Zheng, Shyam Dwaraknath, Matthew K Horton, Jordi Cabana, John Rehr, John Vinson, Alan Dozier, Joshua J Kas, et al.
Published: 11 June 2021
Scientific Data, Volume 8; doi:10.1038/s41597-021-00936-5

The publisher has not yet granted permission to display this abstract.
Carlos Gaete-Morales, Hendrik Kramer, , Alexander Zerrahn
Published: 11 June 2021
Scientific Data, Volume 8; doi:10.1038/s41597-021-00932-9

The publisher has not yet granted permission to display this abstract.
, , Fagner de O. Bernardo, Alessandra H. G. Tobias, Paulo H. C. Oliveira, Tales M. Machado, Caio S. Costa, , , , et al.
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00933-8

Abstract:
Amidst the current health crisis and social distancing, telemedicine has become an important part of mainstream of healthcare, and building and deploying computational tools to support screening more efficiently is an increasing medical priority. The early identification of cervical cancer precursor lesions by Pap smear test can identify candidates for subsequent treatment. However, one of the main challenges is the accuracy of the conventional method, often subject to high rates of false negative. While machine learning has been highlighted to reduce the limitations of the test, the absence of high-quality curated datasets has prevented strategies development to improve cervical cancer screening. The Center for Recognition and Inspection of Cells (CRIC) platform enables the creation of CRIC Cervix collection, currently with 400 images (1,376 × 1,020 pixels) curated from conventional Pap smears, with manual classification of 11,534 cells. This collection has the potential to advance current efforts in training and testing machine learning algorithms for the automation of tasks as part of the cytopathological analysis in the routine work of laboratories.
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00931-w

Abstract:
European plethodontid salamanders (genus Speleomantes; formerly Hydromantes) are a group of eight strictly protected amphibian species which are sensitive to human-induced environmental changes. Long-term monitoring is highly recommended to evaluate their status and to assess potential threats. Here we used two low-impact methodologies to build up a large dataset on two mainland Speleomantes species (S. strinatii and S. ambrosii), which represents an update to two previously published datasets, but also includes several new populations. Specifically, we provide a set of 851 high quality images and a table gathering stomach contents recognized from 560 salamanders. This dataset offers the opportunity to analyse phenotypic traits and stomach contents of eight populations belonging to two Speleomantes species. Furthermore, the data collection performed over different periods allows to expand the potential analyses through a wide temporal scale, allowing long-term studies.
, Raymond H. Grossman, Elliot G. Mitchell, Chunhua Weng, Karthik Natarajan, George Hripcsak, David K. Vawdrey
Scientific Data, Volume 8, pp 1-9; doi:10.1038/s41597-021-00929-4

Abstract:
The recognition, disambiguation, and expansion of medical abbreviations and acronyms is of upmost importance to prevent medically-dangerous misinterpretation in natural language processing. To support recognition, disambiguation, and expansion, we present the Medical Abbreviation and Acronym Meta-Inventory, a deep database of medical abbreviations. A systematic harmonization of eight source inventories across multiple healthcare specialties and settings identified 104,057 abbreviations with 170,426 corresponding senses. Automated cross-mapping of synonymous records using state-of-the-art machine learning reduced redundancy, which simplifies future application. Additional features include semi-automated quality control to remove errors. The Meta-Inventory demonstrated high completeness or coverage of abbreviations and senses in new clinical text, a substantial improvement over the next largest repository (6–14% increase in abbreviation coverage; 28–52% increase in sense coverage). To our knowledge, the Meta-Inventory is the most complete compilation of medical abbreviations and acronyms in American English to-date. The multiple sources and high coverage support application in varied specialties and settings. This allows for cross-institutional natural language processing, which previous inventories did not support. The Meta-Inventory is available at https://bit.ly/github-clinical-abbreviations.
Scientific Data, Volume 8; doi:10.1038/s41597-021-00907-w

Abstract:
The analysis of energy scenarios for future energy systems requires appropriate data. However, while more or less detailed data on energy production is often available, appropriate data on energy consumption is often scarce. In our JERICHO-E-usage dataset, we provide comprehensive data on useful energy consumption patterns for heat, cold, mechanical energy, information and communication, and light in high spatial and temporal resolution. Furthermore, we distinguish between residential, industrial, commerce, and mobility consumers. For our dataset, we aggregate bottom-up data and disaggregate top-down data both to the NUTS2 level. The NUTS2 level serves as an interface to validate our combined method approach and the calculations. We combine a multitude of data sources such as weather time series, standard load profiles, census data, movement data, and employment figures to increase the scope, validity, and reproducibility for energy system modeling. The focus of our JERICHO-E-usage dataset on useful energy consumption might be of particular interest to researchers who analyze energy scenarios where renewable electricity is largely substituted for fossil fuel (sector coupling).
, , Nigel Goddard, Niklas Berliner, Lynda Webb, Myroslava Dzikovska, , Janek Mann, Charles Sutton, Janette Webb, et al.
Scientific Data, Volume 8, pp 1-18; doi:10.1038/s41597-021-00921-y

Abstract:
The IDEAL household energy dataset described here comprises electricity, gas and contextual data from 255 UK homes over a 23-month period ending in June 2018, with a mean participation duration of 286 days. Sensors gathered 1-second electricity data, pulse-level gas data, 12-second temperature, humidity and light data for each room, and 12-second temperature data from boiler pipes for central heating and hot water. 39 homes also included plug-level monitoring of selected electrical appliances, real-power measurement of mains electricity and key sub-circuits, and more detailed temperature monitoring of gas- and heat-using equipment, including radiators and taps. Survey data included occupant demographics, values, attitudes and self-reported energy awareness, household income, energy tariffs, and building, room and appliance characteristics. Linked secondary data comprises weather and level of urbanisation. The data is provided in comma-separated format with a custom-built API to facilitate usage, and has been cleaned and documented. The data has a wide range of applications, including investigating energy demand patterns and drivers, modelling building performance, and undertaking Non-Intrusive Load Monitoring research.
Hendrika M. Duivenvoorden, Natasha K. Brockwell, , ,
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00924-9

Abstract:
Understanding how cancer cells interact with the surrounding microenvironment early in breast cancer development can provide insight into the initiation and progression of invasive breast cancers. The myoepithelial cell layer surrounding breast ducts acts as a physical barrier in early breast cancer, preventing cancer cells from invading the surrounding stroma. Changes to the expression profile and properties of myoepithelial cells have been implicated in progression to invasive carcinoma. Identifying the molecular drivers of myoepithelial cell-mediated tumour suppression may offer new approaches to predict and block the earliest stages of cancer invasion. We employed a high-content approach to knock down 87 different genes using siRNA in an immortalised myoepithelial cell line, prior to co-culture with invasive breast cancer cells in 3D. Combined with high-content imaging and a customised analysis pipeline, this system was used to identify myoepithelial proteins that are necessary to control cancer cell invasion. This dataset has identified prospective myoepithelial suppressors of early breast cancer invasion which may be used by researchers to investigate their clinical validity and utility.
, , Jiancheng Shi, Tianjie Zhao, Kun Yang, Michael H. Cosh, Daniel J. Short Gianotti, Dara Entekhabi
Scientific Data, Volume 8, pp 1-16; doi:10.1038/s41597-021-00925-8

Abstract:
Long term surface soil moisture (SSM) data with stable and consistent quality are critical for global environment and climate change monitoring. L band radiometers onboard the recently launched Soil Moisture Active Passive (SMAP) Mission can provide the state-of-the-art accuracy SSM, while Advanced Microwave Scanning Radiometer for EOS (AMSR-E) and AMSR2 series provide long term observational records of multi-frequency radiometers (C, X, and K bands). This study transfers the merits of SMAP to AMSR-E/2, and develops a global daily SSM dataset (named as NNsm) with stable and consistent quality at a 36 km resolution (2002–2019). The NNsm can reproduce the SMAP SSM accurately, with a global Root Mean Square Error (RMSE) of 0.029 m3/m3. NNsm also compares well with in situ SSM observations, and outperforms AMSR-E/2 standard SSM products from JAXA and LPRM. This global observation-driven dataset spans nearly two decades at present, and is extendable through the ongoing AMSR2 and upcoming AMSR3 missions for long-term studies of climate extremes, trends, and decadal variability.
, Vajira Thambawita, Steven A. Hicks, Henrik Gjestang, Oda Olsen Nedrejord, Espen Næss, Hanna Borgli, , Tor Jan Derek Berstad, Sigrun L. Eskeland, et al.
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00920-z

Abstract:
Artificial intelligence (AI) is predicted to have profound effects on the future of video capsule endoscopy (VCE) technology. The potential lies in improving anomaly detection while reducing manual labour. Existing work demonstrates the promising benefits of AI-based computer-assisted diagnosis systems for VCE. They also show great potential for improvements to achieve even better results. Also, medical data is often sparse and unavailable to the research community, and qualified medical personnel rarely have time for the tedious labelling work. We present Kvasir-Capsule, a large VCE dataset collected from examinations at a Norwegian Hospital. Kvasir-Capsule consists of 117 videos which can be used to extract a total of 4,741,504 image frames. We have labelled and medically verified 47,238 frames with a bounding box around findings from 14 different classes. In addition to these labelled images, there are 4,694,266 unlabelled frames included in the dataset. The Kvasir-Capsule dataset can play a valuable role in developing better algorithms in order to reach true potential of VCE technology.
, Piljae Im, Yeonjin Bae, Jeff Munk, Teja Kuruganti, Brian Fricke
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00927-6

Abstract:
HVAC and refrigeration system fault detection and diagnostics (FDD) has attracted extensive studies for decades; however, FDD of supermarket refrigeration systems has not gained significant attention. Supermarkets consume around 50 kWh/ft2 of electricity annually. The biggest consumer of energy in a supermarket is its refrigeration system, which accounts for 40%–60% of its total electricity usage and is equivalent to about 2%–3% of the total energy consumed by commercial buildings in the United States. Also, the supermarket refrigeration system is one of the biggest consumers of refrigerants. Reducing refrigerant usage or using environmentally friendly alternatives can result in significant climate benefits. A challenge is the lack of publicly available data sets to benchmark the system performance and record the faulted performance. This paper identifies common faults of supermarket refrigeration systems and conducts an experimental study to collect the faulted performance data and analyze these faults. This work provides a foundation for future research on the development of FDD methods and field automated FDD implementation.
, , Julian Gorfer, Gaurav Pujar, Sebastian Wesselmecking, Ulrich Krupp, Stefano Bromuri
Scientific Data, Volume 8, pp 1-9; doi:10.1038/s41597-021-00926-7

Abstract:
Studying steel microstructures yields important insights regarding its mechanical characteristics. Within steel, microstructures transform based on a multitude of factors including chemical composition, transformation temperatures, and cooling rates. Martensite-austenite (MA) islands in bainitic steel appear as blocky structures with abstract shapes that are difficult to identify and differentiate from other types of microstructures. In this regard, material science may benefit from machine learning models that are able to automatically and accurately detect these structures. However, the training process of the state-of-the-art machine learning models requires a large amount of high-quality data. In this dataset, we provide 1.705 scanning electron microscopy images along with a set of 8.909 expert-annotated polygons to describe the geometry of the MA islands that appear on the images. We envision that this dataset will be useful for material scientists to explore the relationship between the morphology of bainitic steel and mechanical characteristics. Moreover, computer vision researchers and practitioners may use this data for training state-of-the-art object segmentation models for abstract geometries such as MA islands.
, , Trevor J. Popp, Thea Quistgaard, , , Camilla-Marie Jensen, Mika Lanzky, Anine-Maria Lütt, Vasileios Mandrakis, et al.
Scientific Data, Volume 8, pp 1-9; doi:10.1038/s41597-021-00916-9

Abstract:
We report high resolution measurements of the stable isotope ratios of ancient ice (δ 18O, δD) from the North Greenland Eemian deep ice core (NEEM, 77.45° N, 51.06° E). The record covers the period 8–130 ky b2k (y before 2000) with a temporal resolution of ≈0.5 and 7 y at the top and the bottom of the core respectively and contains important climate events such as the 8.2 ky event, the last glacial termination and a series of glacial stadials and interstadials. At its bottom part the record contains ice from the Eemian interglacial. Isotope ratios are calibrated on the SMOW/SLAP scale and reported on the GICC05 (Greenland Ice Core Chronology 2005) and AICC2012 (Antarctic Ice Core Chronology 2012) time scales interpolated accordingly. We also provide estimates for measurement precision and accuracy for both δ 18O and δD.
, , Julie A. Hollis, Nicholas J. Gardiner, Chris Yakymchuk, , , Milo Barham, Bradley J. McDonald, Noreen J. Evans, et al.
Scientific Data, Volume 8, pp 1-9; doi:10.1038/s41597-021-00922-x

Abstract:
Zircon U-Pb geochronology places high-temperature geological events into temporal context. Here, we present a comprehensive zircon U-Pb geochronology dataset for the Meso- to Neoarchean Maniitsoq region in southwest Greenland, which includes the Akia Terrane, Tuno Terrane, and the intervening Alanngua Complex. The magmatic and metamorphic processes recorded in these terranes straddle a key change-point in early Earth geodynamics. This dataset comprises zircon U-Pb ages for 121 samples, including 46 that are newly dated. A principal crystallization peak occurs across all three terranes at ca. 3000 Ma, with subordinate crystallization age peaks at 3200 Ma (Akia Terrane and Alanngua Complex only), 2720 Ma and 2540 Ma. Metamorphic age peaks occur at 2990 Ma, 2820–2700 Ma, 2670–2600 Ma and 2540 Ma. Except for one sample, all dated metamorphic zircon growth after the Neoarchean occurred in the Alanngua Complex or within 20 km of its boundaries. This U-Pb dataset provides an important resource for addressing Earth Science topics as diverse as crustal evolution, fluid–rock interaction and mineral deposit genesis.
Falk Lüsebrink, Hendrik Mattern, Renat Yakupov, , Mohammad Ashtarayeh, Steffen Oeltze-Jafra, Oliver Speck
Scientific Data, Volume 8, pp 1-13; doi:10.1038/s41597-021-00923-w

Abstract:
Here, we present an extension to our previously published structural ultrahigh resolution T1-weighted magnetic resonance imaging (MRI) dataset with an isotropic resolution of 250 µm, consisting of multiple additional ultrahigh resolution contrasts. Included are up to 150 µm Time-of-Flight angiography, an updated 250 µm structural T1-weighted reconstruction, 330 µm quantitative susceptibility mapping, up to 450 µm structural T2-weighted imaging, 700 µm T1-weighted back-to-back scans, 800 µm diffusion tensor imaging, one hour continuous resting-state functional MRI with an isotropic spatial resolution of 1.8 mm as well as more than 120 other structural T1-weighted volumes together with multiple corresponding proton density weighted acquisitions collected over ten years. All data are from the same participant and were acquired on the same 7 T scanner. The repository contains the unprocessed data as well as (pre-)processing results. The data were acquired in multiple studies with individual goals. This is a unique and comprehensive collection comprising a “human phantom” dataset. Therefore, we compiled, processed, and structured the data, making them publicly available for further investigation.
, Kari Lahti, Esko Piirainen, Mikko Heikkinen, Olli Raitio, Aino Juslén
Scientific Data, Volume 8, pp 1-16; doi:10.1038/s41597-021-00919-6

Abstract:
Biodiversity informatics has advanced rapidly with the maturation of major biodiversity data infrastructures (BDDIs), such as the Global Biodiversity Information Facility sharing unprecedented data volumes. Nevertheless, taxonomic, temporal and spatial data coverage remains unsatisfactory. With an increasing data need, the global BDDIs require continuous inflow from local data mobilisation, and national BDDIs are being developed around the world. The global BDDIs are specialised in certain data types or data life cycle stages which, despite possible merits, renders the BDDI landscape fragmented and complex. That this often is repeated at the national level creates counterproductive redundancy, complicates user services, and frustrates funders. Here, we present the Finnish Biodiversity Information Facility (FinBIF) as a model of an all-inclusive BDDI. It integrates relevant data types and phases of the data life cycle, manages them under one IT architecture, and distributes the data through one service portal under one brand. FinBIF has experienced diverse funder engagement and rapid user uptake. Therefore, we suggest the integrated and inclusive approach be adopted in national BDDI development.
Correction
Prangwan Pateetin, Gyorgy Hutvagner, Sarah Bajan, Matthew P. Padula, Eileen M. McGowan,
Scientific Data, Volume 8, pp 1-1; doi:10.1038/s41597-021-00928-5

, Elizabeth M. Bach, Marie L. C. Bartz, , Rémy Beugnon, Maria J. I. Briones, , Olga Ferlian, Konstantin B. Gongalsky, Carlos A. Guerra, et al.
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00912-z

Abstract:
Earthworms are an important soil taxon as ecosystem engineers, providing a variety of crucial ecosystem functions and services. Little is known about their diversity and distribution at large spatial scales, despite the availability of considerable amounts of local-scale data. Earthworm diversity data, obtained from the primary literature or provided directly by authors, were collated with information on site locations, including coordinates, habitat cover, and soil properties. Datasets were required, at a minimum, to include abundance or biomass of earthworms at a site. Where possible, site-level species lists were included, as well as the abundance and biomass of individual species and ecological groups. This global dataset contains 10,840 sites, with 184 species, from 60 countries and all continents except Antarctica. The data were obtained from 182 published articles, published between 1973 and 2017, and 17 unpublished datasets. Amalgamating data into a single global database will assist researchers in investigating and answering a wide variety of pressing questions, for example, jointly assessing aboveground and belowground biodiversity distributions and drivers of biodiversity change.
María S. López, Daniela I. Jordan, Evelyn Blatter, Elisabet Walker, Andrea A. Gómez, Gabriela V. Müller, , Michael A. Robert,
Scientific Data, Volume 8, pp 1-7; doi:10.1038/s41597-021-00914-x

Abstract:
Dengue virus (DENV) transmission occurs primarily in tropical and subtropical climates, but within the last decade it has extended to temperate regions. Santa Fe, a temperate province in Argentina, has experienced an increase in dengue cases and virus circulation since 2009, with the recent 2020 outbreak being the largest in the province to date. The aim of this work is to describe spatio-temporal fluctuations of dengue cases from 2009 to 2020 in Santa Fe Province. The data presented in this work provide a detailed description of DENV transmission for Santa Fe Province by department. These data are useful to assist in investigating drivers of dengue emergence in Santa Fe Province and for developing a better understanding of the drivers and the impacts of ongoing dengue emergence in temperate regions across the world. This work provides data useful for future studies including those investigating socio-ecological, climatic, and environmental factors associated with DENV transmission, as well as those investigating other variables related to the biology and the ecology of vector-borne diseases.
Damir Vrabac, Akshay Smit, Rebecca Rojansky, Yasodha Natkunam, Ranjana H. Advani, Andrew Y. Ng, Sebastian Fernandez-Pol,
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00915-w

Abstract:
Diffuse Large B-Cell Lymphoma (DLBCL) is the most common non-Hodgkin lymphoma. Though histologically DLBCL shows varying morphologies, no morphologic features have been consistently demonstrated to correlate with prognosis. We present a morphologic analysis of histology sections from 209 DLBCL cases with associated clinical and cytogenetic data. Duplicate tissue core sections were arranged in tissue microarrays (TMAs), and replicate sections were stained with H&E and immunohistochemical stains for CD10, BCL6, MUM1, BCL2, and MYC. The TMAs are accompanied by pathologist-annotated regions-of-interest (ROIs) that identify areas of tissue representative of DLBCL. We used a deep learning model to segment all tumor nuclei in the ROIs, and computed several geometric features for each segmented nucleus. We fit a Cox proportional hazards model to demonstrate the utility of these geometric features in predicting survival outcome, and found that it achieved a C-index (95% CI) of 0.635 (0.574,0.691). Our finding suggests that geometric features computed from tumor nuclei are of prognostic importance, and should be validated in prospective studies.
, , Leyden Fernandez, Gaëtan Martin, Gustavo A. Martinez-Rodriguez, Jatta Saarenheimo, , ,
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00910-1

Abstract:
Stratified lakes and ponds featuring steep oxygen gradients are significant net sources of greenhouse gases and hotspots in the carbon cycle. Despite their significant biogeochemical roles, the microbial communities, especially in the oxygen depleted compartments, are poorly known. Here, we present a comprehensive dataset including 267 shotgun metagenomes from 41 stratified lakes and ponds mainly located in the boreal and subarctic regions, but also including one tropical reservoir and one temperate lake. For most lakes and ponds, the data includes a vertical sample set spanning from the oxic surface to the anoxic bottom layer. The majority of the samples were collected during the open water period, but also a total of 29 samples were collected from under the ice. In addition to the metagenomic sequences, the dataset includes environmental variables for the samples, such as oxygen, nutrient and organic carbon concentrations. The dataset is ideal for further exploring the microbial taxonomic and functional diversity in freshwater environments and potential climate change impacts on the functioning of these ecosystems.
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00913-y

Abstract:
Micro-CT provides critical data for musculoskeletal research, yielding three-dimensional datasets containing distributions of mineral density. Using high-resolution scans, we quantified changes in the fine architecture of bone in the spine of young mice. This data is made available as a reference to physiological cancellous bone growth. The scans (n = 19) depict the extensive structural changes typical for female C57BL/6 mice pups, aged 1-, 3-, 7-, 10- and 14-days post-partum, as they attain the mature geometry. We reveal the micro-morphology down to individual trabeculae in the spine that follow phases of mineral-tissue rearrangement in the growing lumbar vertebra on a micrometer length scale. Phantom data is provided to facilitate mineral density calibration. Conventional histomorphometry matched with our micro-CT data on selected samples confirms the validity and accuracy of our 3D scans. The data may thus serve as a reference for modeling normal bone growth and can be used to benchmark other experiments assessing the effects of biomaterials, tissue growth, healing, and regeneration.
Tashina Petersson, , , Marta Antonelli, Katarzyna Dembska, , Alessandra Varotto,
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00909-8

Abstract:
Informing and engaging citizens to adopt sustainable diets is a key strategy for reducing global environmental impacts of the agricultural and food sectors. In this respect, the first requisite to support citizens and actors of the food sector is to provide them a publicly available, reliable and ready to use synthesis of environmental pressures associated to food commodities. Here we introduce the SU-EATABLE LIFE database, a multilevel database of carbon (CF) and water (WF) footprint values of food commodities, based on a standardized methodology to extract information and assign optimal footprint values and uncertainties to food items, starting from peer-reviewed articles and grey literature. The database and its innovative methodological framework for uncertainty treatment and data quality assurance provides a solid basis for evaluating the impact of dietary shifts on global environmental policies, including climate mitigation through greenhouse gas emission reductions. The database ensures repeatability and further expansion, providing a reliable science-based tool for managers and researcher in the food sector.
Kai Furuya, Tao Wu, Ai Orimoto, Eriko Sugano, , , Takahiro Kurose, Yoshihiro Takai,
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00908-9

Abstract:
Cellular immortalization enables indefinite expansion of cultured cells. However, the process of cell immortalization sometimes changes the original nature of primary cells. In this study, we performed expression profiling of poly A-tailed RNA from primary and immortalized corneal epithelial cells expressing Simian virus 40 large T antigen (SV40) or the combination of mutant cyclin-dependent kinase 4 (CDK4), cyclin D1, and telomere reverse transcriptase (TERT). Furthermore, we studied the expression profile of SV40 cells cultured in medium with or without serum. The profiling of whole expression pattern revealed that immortalized corneal epithelial cells with SV40 showed a distinct expression pattern from wild-type cells regardless of the presence or absence of serum, while corneal epithelial cells with combinatorial expression showed an expression pattern relatively closer to that of wild-type cells.
Correction
, , Guido Lemoine, , , Ian McCallum, Hadi, Florian Kraxner, Frédéric Achard, Steffen Fritz
Scientific Data, Volume 8, pp 1-1; doi:10.1038/s41597-021-00917-8

, Thomas Wahl
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00906-x

Abstract:
Storm surges are among the deadliest coastal hazards and understanding how they have been affected by climate change and variability in the past is crucial to prepare for the future. However, tide gauge records are often too short to assess trends and perform robust statistical analyses. Here we use a data-driven modeling framework to simulate daily maximum surge values at 882 tide gauge locations across the globe. We use five different atmospheric reanalysis products for the storm surge reconstruction, the longest one going as far back as 1836. The data that we generate can be used, for example, for long-term trend analyses of the storm surge climate and identification of regions where changes in the intensity and/or frequency of storms surges have occurred in the past. It also provides a better basis for robust extreme value analysis, especially for tide gauges where observational records are short. The data are made available for public use through an interactive web-map as well as a public data repository.
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00905-y

Abstract:
Here, we describe a dataset with information about monogenic, rare diseases with a known genetic background, supplemented with manually extracted provenance for the disease itself and the discovery of the underlying genetic cause. We assembled a collection of 4166 rare monogenic diseases and linked them to 3163 causative genes, annotated with OMIM and Ensembl identifiers and HGNC symbols. The PubMed identifiers of the scientific publications, which for the first time described the rare diseases, and the publications, which found the genes causing the diseases were added using information from OMIM, PubMed, Wikipedia, whonamedit.com, and Google Scholar. The data are available under CC0 license as spreadsheet and as RDF in a semantic model modified from DisGeNET, and was added to Wikidata. This dataset relies on publicly available data and publications with a PubMed identifier, but by our effort to make the data interoperable and linked, we can now analyse this data. Our analysis revealed the timeline of rare disease and causative gene discovery and links them to developments in methods.
Correction
, , , Marwan Cheikh Albassatneh, Juan Arroyo, Gianluigi Bacchetta, Francesca Bagnoli, Zoltán Barina, Manuel Cartereau, , et al.
Scientific Data, Volume 8, pp 1-1; doi:10.1038/s41597-021-00911-0

, Milan Kilibarda, Dragutin Protić,
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00901-2

Abstract:
We produced the first daily gridded meteorological dataset at a 1-km spatial resolution across Serbia for 2000–2019, named MeteoSerbia1km. The dataset consists of five daily variables: maximum, minimum and mean temperature, mean sea-level pressure, and total precipitation. In addition to daily summaries, we produced monthly and annual summaries, and daily, monthly, and annual long-term means. Daily gridded data were interpolated using the Random Forest Spatial Interpolation methodology, based on using the nearest observations and distances to them as spatial covariates, together with environmental covariates to make a random forest model. The accuracy of the MeteoSerbia1km daily dataset was assessed using nested 5-fold leave-location-out cross-validation. All temperature variables and sea-level pressure showed high accuracy, although accuracy was lower for total precipitation, due to the discontinuity in its spatial distribution. MeteoSerbia1km was also compared with the E-OBS dataset with a coarser resolution: both datasets showed similar coarse-scale patterns for all daily meteorological variables, except for total precipitation. As a result of its high resolution, MeteoSerbia1km is suitable for further environmental analyses.
Dheeraj Rathee, , Sujit Roy,
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00899-7

Abstract:
Recent advancements in magnetoencephalography (MEG)-based brain-computer interfaces (BCIs) have shown great potential. However, the performance of current MEG-BCI systems is still inadequate and one of the main reasons for this is the unavailability of open-source MEG-BCI datasets. MEG systems are expensive and hence MEG datasets are not readily available for researchers to develop effective and efficient BCI-related signal processing algorithms. In this work, we release a 306-channel MEG-BCI data recorded at 1KHz sampling frequency during four mental imagery tasks (i.e. hand imagery, feet imagery, subtraction imagery, and word generation imagery). The dataset contains two sessions of MEG recordings performed on separate days from 17 healthy participants using a typical BCI imagery paradigm. The current dataset will be the only publicly available MEG imagery BCI dataset as per our knowledge. The dataset can be used by the scientific community towards the development of novel pattern recognition machine learning methods to detect brain activities related to motor imagery and cognitive imagery tasks using MEG signals.
Parnian Afshar, Shahin Heidarian, Nastaran Enshaei, Farnoosh Naderkhani, Moezedin Javad Rafiee, , Faranak Babaki Fard, Kaveh Samimi, Konstantinos N. Plataniotis,
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00900-3

Abstract:
Novel Coronavirus (COVID-19) has drastically overwhelmed more than 200 countries affecting millions and claiming almost 2 million lives, since its emergence in late 2019. This highly contagious disease can easily spread, and if not controlled in a timely fashion, can rapidly incapacitate healthcare systems. The current standard diagnosis method, the Reverse Transcription Polymerase Chain Reaction (RT- PCR), is time consuming, and subject to low sensitivity. Chest Radiograph (CXR), the first imaging modality to be used, is readily available and gives immediate results. However, it has notoriously lower sensitivity than Computed Tomography (CT), which can be used efficiently to complement other diagnostic methods. This paper introduces a new COVID-19 CT scan dataset, referred to as COVID-CT-MD, consisting of not only COVID-19 cases, but also healthy and participants infected by Community Acquired Pneumonia (CAP). COVID-CT-MD dataset, which is accompanied with lobe-level, slice-level and patient-level labels, has the potential to facilitate the COVID-19 research, in particular COVID-CT-MD can assist in development of advanced Machine Learning (ML) and Deep Neural Network (DNN) based solutions.
, Zijing Dong, , Congyu Liao, Qiuyun Fan, W. Scott Hoge, Boris Keil, , Lawrence L. Wald, , et al.
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00904-z

Abstract:
We present a whole-brain in vivo diffusion MRI (dMRI) dataset acquired at 760 μm isotropic resolution and sampled at 1260 q-space points across 9 two-hour sessions on a single healthy participant. The creation of this benchmark dataset is possible through the synergistic use of advanced acquisition hardware and software including the high-gradient-strength Connectom scanner, a custom-built 64-channel phased-array coil, a personalized motion-robust head stabilizer, a recently developed SNR-efficient dMRI acquisition method, and parallel imaging reconstruction with advanced ghost reduction algorithm. With its unprecedented resolution, SNR and image quality, we envision that this dataset will have a broad range of investigational, educational, and clinical applications that will advance the understanding of human brain structures and connectivity. This comprehensive dataset can also be used as a test bed for new modeling, sub-sampling strategies, denoising and processing algorithms, potentially providing a common testing platform for further development of in vivo high resolution dMRI techniques. Whole brain anatomical T1-weighted and T2-weighted images at submillimeter scale along with field maps are also made available.
Uxue Ulanga, Matthew Russell, Stefano Patassini, Julie Brazzatti, Ciaren Graham, Anthony D. Whetton, Robert L. J. Graham
Scientific Data, Volume 8, pp 1-9; doi:10.1038/s41597-021-00896-w

Abstract:
Murine models are amongst the most widely used systems to study biology and pathology. Targeted quantitative proteomic analysis is a relatively new tool to interrogate such systems. Recently the need for relative quantification on hundreds to thousands of samples has driven the development of Data Independent Acquisition methods. One such technique is SWATH-MS, which in the main requires prior acquisition of mass spectra to generate an assay reference library. In stem cell research, it has been shown pluripotency can be induced starting with a fibroblast population. In so doing major changes in expressed proteins is inevitable. Here we have created a reference library to underpin such studies. This is inclusive of an extensively documented script to enable replication of library generation from the raw data. The documented script facilitates reuse of data and adaptation of the library to novel applications. The resulting library provides deep coverage of the mouse proteome. The library covers 29519 proteins (53% of the proteome) of which 7435 (13%) are supported by a proteotypic peptide.
Scientific Data, Volume 8, pp 1-8; doi:10.1038/s41597-021-00898-8

Abstract:
A critical shortage of ‘big’ agronomic data is placing an unnecessary constraint on the conduct of public agronomic research, imparting barriers to model development and testing. Here, we address this problem by providing a large non-relational database of agronomic trials, linked to intensive management and observational data, run under a unified experimental framework. The National Variety Trials (NVTs) represent a decade-long experimental trial network, conducted across thousands of Australian field sites using highly standardised randomised controlled designs. The NVTs contain over a million machine-measured phenotypic observations, aggregated from density-controlled populations containing hundreds of millions of plants and thousands of released plant varieties. These data are linked to hundreds of thousands of metadata observations including standardised soil tests, fertiliser and pesticide input data, crop rotation data, prior farm management practices, and in-field sensors. Finally, these data are linked to a suite of ground and remote sensing observations, arranged into interpolated daily- and ten-day aggregated time series, to capture the substantial diversity in vegetation and environmental patterns across the continent-spanning NVT network.
Scientific Data, Volume 8, pp 1-14; doi:10.1038/s41597-021-00890-2

Abstract:
Using 11 proteomics datasets, mostly available through the PRIDE database, we assembled a reference expression map for 191 cancer cell lines and 246 clinical tumour samples, across 13 lineages. We found unique peptides identified only in tumour samples despite a much higher coverage in cell lines. These were mainly mapped to proteins related to regulation of signalling receptor activity. Correlations between baseline expression in cell lines and tumours were calculated. We found these to be highly similar across all samples with most similarity found within a given sample type. Integration of proteomics and transcriptomics data showed median correlation across cell lines to be 0.58 (range between 0.43 and 0.66). Additionally, in agreement with previous studies, variation in mRNA levels was often a poor predictor of changes in protein abundance. To our knowledge, this work constitutes the first meta-analysis focusing on cancer-related public proteomics datasets. We therefore also highlight shortcomings and limitations of such studies. All data is available through PRIDE dataset identifier PXD013455 and in Expression Atlas.
Scientific Data, Volume 8, pp 1-11; doi:10.1038/s41597-021-00897-9

Abstract:
Human settlements are usually nucleated around manmade central points or distinctive natural features, forming clusters that vary in shape and size. However, population distribution in geo-sciences is often represented in the form of pixelated rasters. Rasters indicate population density at predefined spatial resolutions, but are unable to capture the actual shape or size of settlements. Here we suggest a methodology that translates high-resolution raster population data into vector-based population clusters. We use open-source data and develop an open-access algorithm tailored for low and middle-income countries with data scarcity issues. Each cluster includes unique characteristics indicating population, electrification rate and urban-rural categorization. Results are validated against national electrification rates provided by the World Bank and data from selected Demographic and Health Surveys (DHS). We find that our modeled national electrification rates are consistent with the rates reported by the World Bank, while the modeled urban/rural classification has 88% accuracy. By delineating settlements, this dataset can complement existing raster population data in studies such as energy planning, urban planning and disease response.
Correction
, , Camilla Ceccarani, Emily Fontana, Luigi A. Amoretti, Roberta J. Wright, , , Ying Taur, Miguel-Angel Perales, et al.
Scientific Data, Volume 8, pp 1-1; doi:10.1038/s41597-021-00903-0

Abstract:
A Correction to this paper has been published: https://doi.org/10.1038/s41597-021-00903-0.
, Undiagnosed Diseases Network, , Erika M. Zink, , Kent J. Bloodsworth, , , , , et al.
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00894-y

Abstract:
Every year individuals experience symptoms that remain undiagnosed by healthcare providers. In the United States, these rare diseases are defined as a condition that affects fewer than 200,000 individuals. However, there are an estimated 7000 rare diseases, and there are an estimated 25–30 million Americans in total (7.6–9.2% of the population as of 2018) affected by such disorders. The NIH Common Fund Undiagnosed Diseases Network (UDN) seeks to provide diagnoses for individuals with undiagnosed disease. Mass spectrometry-based metabolomics and lipidomics analyses could advance the collective understanding of individual symptoms and advance diagnoses for individuals with heretofore undiagnosed disease. Here, we report the mass spectrometry-based metabolomics and lipidomics analyses of blood plasma, urine, and cerebrospinal fluid from 148 patients within the UDN and their families, as well as from a reference population of over 100 individuals with no known metabolic diseases. The raw and processed data are available to the research community so that they might be useful in the diagnoses of current or future patients suffering from undiagnosed disorders.
Tae Woong Whon, Seung Woo Ahn, Sungjin Yang, Joon Yong Kim, Yeon Bee Kim, Yujin Kim, Ji-Man Hong, Hojin Jung, Yoon-E Choi, , et al.
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00895-x

Abstract:
ODFM is a data management system that integrates comprehensive omics information for microorganisms associated with various fermented foods, additive ingredients, and seasonings (e.g. kimchi, Korean fermented vegetables, fermented seafood, solar salt, soybean paste, vinegar, beer, cheese, sake, and yogurt). The ODFM archives genome, metagenome, metataxonome, and (meta)transcriptome sequences of fermented food-associated bacteria, archaea, eukaryotic microorganisms, and viruses; 131 bacterial, 38 archaeal, and 28 eukaryotic genomes are now available to users. The ODFM provides both the Basic Local Alignment Search Tool search-based local alignment function as well as average nucleotide identity-based genetic relatedness measurement, enabling gene diversity and taxonomic analyses of an input query against the database. Genome sequences and annotation results of microorganisms are directly downloadable, and the microbial strains registered in the archive library will be available from our culture collection of fermented food-associated microorganisms. The ODFM is a comprehensive database that covers the genomes of an entire microbiome within a specific food ecosystem, providing basic information to evaluate microbial isolates as candidate fermentation starters for fermented food production.
, Ellen M. Considine, Melissa M. Maestas, Gina Li
Scientific Data, Volume 8, pp 1-15; doi:10.1038/s41597-021-00891-1

Abstract:
We created daily concentration estimates for fine particulate matter (PM2.5) at the centroids of each county, ZIP code, and census tract across the western US, from 2008–2018. These estimates are predictions from ensemble machine learning models trained on 24-hour PM2.5 measurements from monitoring station data across 11 states in the western US. Predictor variables were derived from satellite, land cover, chemical transport model (just for the 2008–2016 model), and meteorological data. Ten-fold spatial and random CV R2 were 0.66 and 0.73, respectively, for the 2008–2016 model and 0.58 and 0.72, respectively for the 2008–2018 model. Comparing areal predictions to nearby monitored observations demonstrated overall R2 of 0.70 for the 2008–2016 model and 0.58 for the 2008–2018 model, but we observed higher R2 (>0.80) in many urban areas. These data can be used to understand spatiotemporal patterns of, exposures to, and health impacts of PM2.5 in the western US, where PM2.5 levels have been heavily impacted by wildfire smoke over this time period.
, Roberta Bardelli,
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00888-w

Abstract:
The Atlantic blue crab Callinectes sapidus is a portunid native to the western Atlantic, from New England to Uruguay. The species was introduced in Europe in 1901 where it has become invasive; additionally, a significant northward expansion has been emphasized in its native range. Here we present a harmonized global compilation of C. sapidus occurrences from native and non-native distribution ranges derived from online databases (GBIF, BISON, OBIS, and iNaturalist) as well as from unpublished and published sources. The dataset consists of 40,388 geo-referenced occurrences, 39,824 from native and 564 from non-native ranges, recorded in 53 countries. The implementation of quality controls imposed a severe reduction, in particular from online databases, of the records selected for inclusion in the dataset. In addition, a technical validation procedure was used to flag entries showing identical coordinates but different year of record, in-land occurrences and those located close to the coast. Similarly, a flagging system identified entries outside the known distribution of the species, or associated with unsuccessful introductions.
Scientific Data, Volume 8, pp 1-12; doi:10.1038/s41597-021-00893-z

Abstract:
Deep learning approaches for tomographic image reconstruction have become very effective and have been demonstrated to be competitive in the field. Comparing these approaches is a challenging task as they rely to a great extent on the data and setup used for training. With the Low-Dose Parallel Beam (LoDoPaB)-CT dataset, we provide a comprehensive, open-access database of computed tomography images and simulated low photon count measurements. It is suitable for training and comparing deep learning methods as well as classical reconstruction approaches. The dataset contains over 40000 scan slices from around 800 patients selected from the LIDC/IDRI database. The data selection and simulation setup are described in detail, and the generating script is publicly accessible. In addition, we provide a Python library for simplified access to the dataset and an online reconstruction challenge. Furthermore, the dataset can also be used for transfer learning as well as sparse and limited-angle reconstruction scenarios.
, Catherine A. Garcia, Nathan Garcia, , , , , , Rolf E. Sonnerup, , et al.
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00889-9

Abstract:
Detailed descriptions of microbial communities have lagged far behind physical and chemical measurements in the marine environment. Here, we present 971 globally distributed surface ocean metagenomes collected at high spatio-temporal resolution. Our low-cost metagenomic sequencing protocol produced 3.65 terabases of data, where the median number of base pairs per sample was 3.41 billion. The median distance between sampling stations was 26 km. The metagenomic libraries described here were collected as a part of a biological initiative for the Global Ocean Ship-based Hydrographic Investigations Program, or “Bio-GO-SHIP.” One of the primary aims of GO-SHIP is to produce high spatial and vertical resolution measurements of key state variables to directly quantify climate change impacts on ocean environments. By similarly collecting marine metagenomes at high spatiotemporal resolution, we expect that this dataset will help answer questions about the link between microbial communities and biogeochemical fluxes in a changing ocean.
, Eric C. Fields, Elizabeth A. Kensinger
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00886-y

Abstract:
While there was a necessary initial focus on physical health consequences of the COVID-19 pandemic, it is becoming increasingly clear that many have experienced significant social and mental health repercussions as well. It is important to understand the effects of the pandemic on well-being, both as the world continues to recover from the lasting impact of COVID-19 and in the eventual case of future pandemics. On March 20, 2020, we launched an online daily survey study tracking participants’ sleep and mental well-being. Repeated reports of sleep and mental health metrics were collected from participants ages 18–90 during the initial wave of the pandemic (March 20 – June 23, 2020). Given both the comprehensive nature and early start of this assessment, open access to this dataset will allow researchers to answer a range of questions regarding the psychiatric impact of the COVID-19 pandemic and the fallout left in its wake.
, Edit Herczog, , Keith Russell,
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00892-0

Abstract:
As big data, open data, and open science advance to increase access to complex and large datasets for innovation, discovery, and decision-making, Indigenous Peoples’ rights to control and access their data within these data environments remain limited. Operationalizing the FAIR Principles for scientific data with the CARE Principles for Indigenous Data Governance enhances machine actionability and brings people and purpose to the fore to resolve Indigenous Peoples’ rights to and interests in their data across the data lifecycle.
, Roberta Bottarin
Scientific Data, Volume 8, pp 1-6; doi:10.1038/s41597-021-00887-x

Abstract:
The present dataset contains information about aquatic macroinvertebrates and environmental variables collected before and after the implementation of a small “run-of-river” hydropower plant on the Saldur stream, a glacier-fed stream located in the Italian Central-Eastern Alps. Between 2015 and 2019, with two sampling events per year, we collected and identified 34,836 organisms in 6 sampling sites located within a 6 km stretch of the stream. Given the current boom of the hydropower sector worldwide, and the growing contribution of small hydropower plants to energy production, data here included may represent an important – and long advocated – baseline to assess the effects that these kinds of powerplants have on the riverine ecosystem. Moreover, since the Saldur stream is part of the International Long Term Ecological Research network, this dataset also constitutes part of the data gathered within this research programme. All samples are preserved at Eurac Research facilities.
Kevin F. Garrity,
Scientific Data, Volume 8, pp 1-10; doi:10.1038/s41597-021-00885-z

Abstract:
Wannier tight-binding Hamiltonians (WTBH) provide a computationally efficient way to predict electronic properties of materials. In this work, we develop a computational workflow for high-throughput Wannierization of density functional theory (DFT) based electronic band structure calculations. We apply this workflow to 1771 materials (1406 3D and 365 2D), and we create a database with the resulting WTBHs. We evaluate the accuracy of the WTBHs by comparing the Wannier band structures to directly calculated spin-orbit coupling DFT band structures. Our testing includes k-points outside the grid used in the Wannierization, providing an out-of-sample test of accuracy. We illustrate the use of WTBHs with a few example applications. We also develop a web-app that can be used to predict electronic properties on-the-fly using WTBH from our database. The tools to generate the Hamiltonian and the database of the WTB parameters are made publicly available through the websites https://github.com/usnistgov/jarvis and https://jarvis.nist.gov/jarviswtb.
Kangkang Tong, Ajay Singh Nagpure,
Scientific Data, Volume 8, pp 1-13; doi:10.1038/s41597-021-00853-7

Abstract:
India is the third-largest contributor to global energy-use and anthropogenic carbon emissions. India’s urban energy transitions are critical to meet its climate goals due to the country’s rapid urbanization. However, no baseline urban energy-use dataset covers all Indian urban districts in ways that align with national totals and integrate social-economic-infrastructural attributes to inform such transitions. This paper develops a novel bottom-up plus top-down approach, comprehensively integrating multiple field surveys and utilizing machine learning, to model All Urban areas’ Energy-use (AllUrE) across all 640 districts in India, merged with social-economic-infrastructural data. Energy use estimates in this AllUrE-India dataset are evaluated by comparing with reported energy-use at three scales: nation-wide, state-wide, and city-level. Spatially granular AllUrE data aggregated nationally show good agreement with national totals (<2% difference). The goodness-of-fit ranged from 0.78–0.95 for comparison with state-level totals, and 0.90–0.99 with city-level data for different sectors. The relatively strong alignment at all three spatial scales demonstrates the value of AllUrE-India data for modelling urban energy transitions consistent with national energy and climate goals.
Page of 36
Articles per Page
by
Show export options
  Select all
Back to Top Top