Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles

Open Access

11 June 2008

journal article
Published by Springer Science and Business Media LLC in BMC Bioinformatics

Vol. 9 (1), 275
https://doi.org/10.1186/1471-2105-9-275

Abstract

Pancreatic cancer is the fourth leading cause of cancer death in the United States. Consequently, identification of clinically relevant biomarkers for the early detection of this cancer type is urgently needed. In recent years, proteomics profiling techniques combined with various data analysis methods have been successfully used to gain critical insights into processes and mechanisms underlying pathologic conditions, particularly as they relate to cancer. However, the high dimensionality of proteomics data combined with their relatively small sample sizes poses a significant challenge to current data mining methodology where many of the standard methods cannot be applied directly. Here, we propose a novel methodological framework using machine learning method, in which decision tree based classifier ensembles coupled with feature selection methods, is applied to proteomics data generated from premalignant pancreatic cancer. This study explores the utility of three different feature selection schemas (Student t test, Wilcoxon rank sum test and genetic algorithm) to reduce the high dimensionality of a pancreatic cancer proteomic dataset. Using the top features selected from each method, we compared the prediction performances of a single decision tree algorithm C4.5 with six different decision-tree based classifier ensembles (Random forest, Stacked generalization, Bagging, Adaboost, Logitboost and Multiboost). We show that ensemble classifiers always outperform single decision tree classifier in having greater accuracies and smaller prediction errors when applied to a pancreatic cancer proteomics dataset. In our cross validation framework, classifier ensembles generally have better classification accuracies compared to that of a single decision tree when applied to a pancreatic cancer proteomic dataset, thus suggesting its utility in future proteomics data analysis. Additionally, the use of feature selection method allows us to select biomarkers with potentially important roles in cancer development, therefore highlighting the validity of this method.

Keywords

This publication has 57 references indexed in Scilit:

Comparative proteomic analysis of human pancreatic juice: Methodological study
Proteomics, 2007
Using stacked generalization to predict membrane protein types based on pseudo-amino acid composition
Journal of Theoretical Biology, 2006
Quantitative proteomic profiling of pancreatic cancer juice
Proteomics, 2006
Using LogitBoost classifier to predict protein structural classes
Journal of Theoretical Biology, 2005
Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data
Bioinformatics, 2005
Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data
Bioinformatics, 2005
Pancreatic cancer: future outlook, promising trials, newer systemic agents, and strategies from the Gastrointestinal intergroup pancreatic cancer task force
Surgical Oncology Clinics of North America, 2004
Mass Spectrometric Analysis of Protein Markers for Ovarian Cancer
Clinical Chemistry, 2004
Proteomic analysis of lung biopsies: Differential protein expression profile between peritumoral and tumoral tissue
Proteomics, 2004
A maximum-likelihood base caller for DNA sequencing
IEEE Transactions on Biomedical Engineering, 2000

Cited by 68 articles