Microarray data mining using landmark gene-guided clustering

Open Access

11 February 2008

journal article
Published by Springer Science and Business Media LLC in BMC Bioinformatics

Vol. 9 (1), 92
https://doi.org/10.1186/1471-2105-9-92

Abstract

Background Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset. Results By applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters. Conclusion We have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.

Keywords

This publication has 47 references indexed in Scilit:

Genome‐wide analysis of high‐altitude maize and gene knockdown stocks implicates chromatin remodeling proteins in response to UV‐B
The Plant Journal, 2006
Mining yeast in silico unearths a golden nugget for mitochondrial biology
JCI Insight, 2005
Ontological analysis of gene expression data: current tools, limitations, and open problems
Bioinformatics, 2005
Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data
Biometrics, 2005
Systematic Analysis of the Relation of Electron Transport and ATP Synthesis to the Photodamage and Repair of Photosystem II in Synechocystis
Plant Physiology, 2005
Cluster analysis for gene expression data: a survey
IEEE Transactions on Knowledge and Data Engineering, 2004
Latent Semantic Indexing: A Probabilistic Analysis
Journal of Computer and System Sciences, 2000
Regulation of the yeast cell cycle by transcription and proteolysis of cyclin-dependent kinase regulators
Kidney International, 1999
Salicylic Acid Induces Rapid Inhibition of Mitochondrial Electron Transport and Oxidative Phosphorylation in Tobacco Cells1
Plant Physiology, 1999
Selective proteolysis defines two DNA binding domains in yeast transcription factor τ
Nature, 1986

Cited by 19 articles