Uniform genomic data analysis in the NCI Genomic Data Commons
Open Access
- 22 February 2021
- journal article
- research article
- Published by Springer Science and Business Media LLC in Nature Communications
- Vol. 12 (1), 1-11
- https://doi.org/10.1038/s41467-021-21254-9
Abstract
The goal of the National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive (https://gdc.cancer.gov/).This publication has 43 references indexed in Scilit:
- STAR: ultrafast universal RNA-seq alignerBioinformatics, 2012
- Comprehensive molecular characterization of human colon and rectal cancerNature, 2012
- SomaticSniper: identification of somatic point mutations in whole genome sequencing dataBioinformatics, 2011
- A framework for variation discovery and genotyping using next-generation DNA sequencing dataNature Genetics, 2011
- Fast and accurate long-read alignment with Burrows–Wheeler transformBioinformatics, 2010
- edgeR: a Bioconductor package for differential expression analysis of digital gene expression dataBioinformatics, 2009
- Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVsNature Genetics, 2008
- Mapping and quantifying mammalian transcriptomes by RNA-SeqNature Methods, 2008
- The NCBI dbGaP database of genotypes and phenotypesNature Genetics, 2007
- Sequencing genomes from single cells by polymerase cloningNature Biotechnology, 2006