Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies
Top Cited Papers
Open Access
- 8 July 2011
- journal article
- research article
- Published by Springer Science and Business Media LLC in Metabolomics
- Vol. 8 (S1), 3-16
- https://doi.org/10.1007/s11306-011-0330-3
Abstract
Partial Least Squares-Discriminant Analysis (PLS-DA) is a PLS regression method with a special binary ‘dummy’ y-variable and it is commonly used for classification purposes and biomarker selection in metabolomics studies. Several statistical approaches are currently in use to validate outcomes of PLS-DA analyses e.g. double cross validation procedures or permutation testing. However, there is a great inconsistency in the optimization and the assessment of performance of PLS-DA models due to many different diagnostic statistics currently employed in metabolomics data analyses. In this paper, properties of four diagnostic statistics of PLS-DA, namely the number of misclassifications (NMC), the Area Under the Receiver Operating Characteristic (AUROC), Q 2 and Discriminant Q 2 (DQ 2) are discussed. All four diagnostic statistics are used in the optimization and the performance assessment of PLS-DA models of three different-size metabolomics data sets obtained with two different types of analytical platforms and with different levels of known differences between two groups: control and case groups. Statistical significance of obtained PLS-DA models was evaluated with permutation testing. PLS-DA models obtained with NMC and AUROC are more powerful in detecting very small differences between groups than models obtained with Q 2 and Discriminant Q 2 (DQ 2). Reproducibility of obtained PLS-DA models outcomes, models complexity and permutation test distributions are also investigated to explain this phenomenon. DQ 2 and Q 2 (in contrary to NMC and AUROC) prefer PLS-DA models with lower complexity and require higher number of permutation tests and submodels to accurately estimate statistical significance of the model performance. NMC and AUROC seem more efficient and more reliable diagnostic statistics and should be recommended in two group discrimination metabolomic studies.Keywords
This publication has 32 references indexed in Scilit:
- A geometric interpretation of the permutation p-value and its application in eQTL studiesThe Annals of Applied Statistics, 2010
- Pattern recognition of Inductively Coupled Plasma Atomic Emission Spectroscopy of human scalp hair for discriminating between healthy and Hepatitis C patientsAnalytica Chimica Acta, 2009
- MetaboAnalyst: a web server for metabolomic data analysis and interpretationNucleic Acids Research, 2009
- Multilevel Data Analysis of a Crossover Designed Human Nutritional Intervention StudyJournal of Proteome Research, 2008
- Assessing the statistical validity of proteomics based biomarkersAnalytica Chimica Acta, 2007
- Evaluation of organochlorine pesticides in serum from students in Coimbra, Portugal: 1997–2001Environmental Research, 2006
- Evaluation of the Orthogonal Projection on Latent Structure Model Limitations Caused by Chemical Shift Variability and Improved Visualization of Biomarker Changes in 1H NMR Spectroscopic Metabonomic StudiesAnalytical Chemistry, 2004
- Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experimentsFEBS Letters, 2004
- Agricultural Spray ChemicalsOccupational Medicine, 1958
- On the Theory of Scales of MeasurementScience, 1946