A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing
Open Access
- 4 June 2021
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 31 (10), 1767-1780
- https://doi.org/10.1101/gr.275569.121
Abstract
Single-cell genomics is rapidly advancing our knowledge of the diversity of cell phenotypes, including both cell types and cell states. Driven by single-cell/-nucleus RNA sequencing (scRNA-seq), comprehensive cell atlas projects characterizing a wide range of organisms and tissues are currently underway. As a result, it is critical that the transcriptional phenotypes discovered are defined and disseminated in a consistent and concise manner. Molecular biomarkers have historically played an important role in biological research, from defining immune cell types by surface protein expression to defining diseases by their molecular drivers. Here, we describe a machine learning-based marker gene selection algorithm, NS-Forest version 2.0, which leverages the nonlinear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that optimally capture the cell type identity represented in complete scRNA-seq transcriptional profiles. The marker genes selected provide an expression barcode that serves as both a useful tool for downstream biological investigation and the necessary and sufficient characteristics for semantic cell type definition. The use of NS-Forest to identify marker genes for human brain middle temporal gyrus cell types reveals the importance of cell signaling and noncoding RNAs in neuronal cell type identity.Keywords
Funding Information
- National Institutes of Health (R21-AI122100, U19-AI118626)
- California Institute for Regenerative Medicine (GC1R-06673-B)
- Wellcome Trust (208379/Z/17/Z)
- Chan Zuckerberg Initiative DAF
- Silicon Valley Community Foundation (2018-182730)
- Nederlandse Organisatie voor Wetenschappelijk Onderzoek
- BRAINSCAPES: A Roadmap from Neurogenetics to Neurobiology (NWO: 024.004.012)
- NWO TTW project 3DOMICS (NWO: 17126)
This publication has 36 references indexed in Scilit:
- Cytosplore: Interactive Immune Cell Phenotyping for Large Single‐Cell DatasetsComputer Graphics Forum, 2016
- The FAIR Guiding Principles for scientific data management and stewardshipScientific Data, 2016
- Using single nuclei for RNA-seq to capture the transcriptome of postmortem neuronsNature Protocols, 2016
- A survey of human brain transcriptome diversity at the single cell levelProceedings of the National Academy of Sciences of the United States of America, 2015
- Dimensionality reduction for documents with nearest neighbor queriesNeurocomputing, 2015
- Landscape of transcription in human cellsNature, 2012
- Hematopoietic cell types: Prototype for a revised cell ontologyJournal of Biomedical Informatics, 2011
- Logical Development of the Cell OntologyBMC Bioinformatics, 2011
- Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene listsNucleic Acids Research, 2008
- An ontology for cell typesGenome Biology, 2005