A supervised clustering MCMC methodology for large categorical feature spaces
- 1 June 2021
- journal article
- research article
- Published by SAGE Publications in Statistical Methods in Medical Research
- Vol. 30 (7), 1708-1724
- https://doi.org/10.1177/09622802211009258
Abstract
There is a well-established tradition within the statistics literature that explores different techniques for reducing the dimensionality of large feature spaces. The problem is central to machine learning and it has been largely explored under the unsupervised learning paradigm. We introduce a supervised clustering methodology that capitalizes on a Metropolis Hastings algorithm to optimize the partition structure of a large categorical feature space tailored towards minimizing the test error of a learning algorithm. This is a general methodology that can be applied to any supervised learning problem with a large categorical feature space. We show the benefits of the algorithm by applying this methodology to the problem of risk adjustment in competitive health insurance markets. We use a large claims data set that records ICD-10 codes, a large categorical feature space. We aim at improving risk adjustment by clustering diagnostic codes into risk groups suitable for health expenditure prediction. We test the performance of our methodology against common alternatives using panel data from a representative sample of twenty three million citizens in Colombian Healthcare System. Our results outperform common alternatives and suggest that it has potential to improve risk adjustment.This publication has 14 references indexed in Scilit:
- Risk selection in a regulated health insurance market: a review of the concept, possibilities and effectsExpert Review of Pharmacoeconomics & Outcomes Research, 2013
- Risk equalization in The Netherlands: an empirical evaluationExpert Review of Pharmacoeconomics & Outcomes Research, 2013
- Local availability of physicians' services as a tool for implicit risk selectionSocial Science & Medicine (1982), 2013
- Biased selection within the social health insurance market in ColombiaHealth Policy, 2006
- Semi-Supervised Methods to Predict Patient Survival from Gene Expression DataPLoS Biology, 2004
- Clinical Risk Groups (CRGs)Medical Care, 2004
- Partition-distance: A problem and class of perfect graphs arising in clusteringInformation Processing Letters, 2002
- Access to coverage for high-risks in a competitive individual health insurance market: via premium rate restrictions or risk-adjusted premium subsidies?Journal of Health Economics, 2000
- Simulated AnnealingStatistical Science, 1993
- Generating a random permutation with random transpositionsProbability Theory and Related Fields, 1981