Using uncorrelated discriminant analysis for tissue classification with gene expression data
- 1 October 2004
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE/ACM Transactions on Computational Biology and Bioinformatics
- Vol. 1 (4), 181-190
- https://doi.org/10.1109/tcbb.2004.45
Abstract
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear discriminant analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called uncorrelated linear discriminant analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the generalized singular value decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data setsKeywords
This publication has 30 references indexed in Scilit:
- An optimization criterion for generalized discriminant analysis on undersampled problemsIEEE Transactions on Pattern Analysis and Machine Intelligence, 2004
- Structure Preserving Dimension Reduction for Clustered Text Data Based on the Generalized Singular Value DecompositionSIAM Journal on Matrix Analysis and Applications, 2003
- Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression DataJournal of the American Statistical Association, 2002
- Molecular classification of multiple tumor typesBioinformatics, 2001
- Tissue Classification with Gene Expression ProfilesJournal of Computational Biology, 2000
- Distinct types of diffuse large B-cell lymphoma identified by gene expression profilingNature, 2000
- Knowledge-based analysis of microarray gene expression data by using support vector machinesProceedings of the National Academy of Sciences of the United States of America, 2000
- Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arraysProceedings of the National Academy of Sciences of the United States of America, 1999
- A Tutorial on Support Vector Machines for Pattern RecognitionData Mining and Knowledge Discovery, 1998
- Eigenfaces vs. Fisherfaces: recognition using class specific linear projectionIeee Transactions On Pattern Analysis and Machine Intelligence, 1997