Phenotype prediction based on genome-wide DNA methylation data
Open Access
- 17 June 2014
- journal article
- Published by Springer Science and Business Media LLC in BMC Bioinformatics
- Vol. 15 (1), 193
- https://doi.org/10.1186/1471-2105-15-193
Abstract
DNA methylation (DNAm) has important regulatory roles in many biological processes and diseases. It is the only epigenetic mark with a clear mechanism of mitotic inheritance and the only one easily available on a genome scale. Aberrant cytosine-phosphate-guanine (CpG) methylation has been discussed in the context of disease aetiology, especially cancer. CpG hypermethylation of promoter regions is often associated with silencing of tumour suppressor genes and hypomethylation with activation of oncogenes.Supervised principal component analysis (SPCA) is a popular machine learning method. However, in a recent application to phenotype prediction from DNAm data SPCA was inferior to the specific method EVORA. We present Model-Selection-SPCA (MS-SPCA), an enhanced version of SPCA. MS-SPCA applies several models that perform well in the training data to the test data and selects the very best models for final prediction based on parameters of the test data.We have applied MS-SPCA for phenotype prediction from genome-wide DNAm data. CpGs used for prediction are selected based on the quantification of three features of their methylation (average methylation difference, methylation variation difference and methylation-age-correlation). We analysed four independent case-control datasets that correspond to different stages of cervical cancer: (i) cases currently cytologically normal, but will later develop neoplastic transformations, (ii, iii) cases showing neoplastic transformations and (iv) cases with confirmed cancer. The first dataset was split into several smaller case-control datasets (samples either Human Papilloma Virus (HPV) positive or negative). We demonstrate that cytology normal HPV+ and HPV- samples contain DNAm patterns which are associated with later neoplastic transformations. We present evidence that DNAm patterns exist in cytology normal HPV- samples that (i) predispose to neoplastic transformations after HPV infection and (ii) predispose to HPV infection itself. MS-SPCA performs significantly better than EVORA. MS-SPCA can be applied to many classification problems. Additional improvements could include usage of more than one principal component (PC), with automatic selection of the optimal number of PCs. We expect that MS-SPCA will be useful for analysing recent larger DNAm data to predict future neoplastic transformations.Keywords
This publication has 42 references indexed in Scilit:
- A DNA methylation classifier of cervical precancer based on human papillomavirus and human genesInternational Journal of Cancer, 2014
- Minireview: Epigenetics of Obesity and Diabetes in HumansEndocrinology, 2012
- Analysis of High Accuracy, Quantitative Proteomics Data in the MaxQB DatabaseMolecular & Cellular Proteomics, 2012
- CCDB: a curated database of genes involved in cervix cancerNucleic Acids Research, 2010
- Common SNPs explain a large proportion of the heritability for human heightNature Genetics, 2010
- Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancerGenome Research, 2010
- Histone modifications silence the GATA transcription factor genes in ovarian cancerOncogene, 2006
- DNA Methylation and CancerJournal of Clinical Oncology, 2004
- Semi-Supervised Methods to Predict Patient Survival from Gene Expression DataPLoS Biology, 2004
- Diagnosis of multiple cancer types by shrunken centroids of gene expressionProceedings of the National Academy of Sciences of the United States of America, 2002