A supervised clustering MCMC methodology for large categorical feature spaces

1 June 2021

journal article
research article
Published by SAGE Publications in Statistical Methods in Medical Research

Vol. 30 (7), 1708-1724
https://doi.org/10.1177/09622802211009258

Abstract

There is a well-established tradition within the statistics literature that explores different techniques for reducing the dimensionality of large feature spaces. The problem is central to machine learning and it has been largely explored under the unsupervised learning paradigm. We introduce a supervised clustering methodology that capitalizes on a Metropolis Hastings algorithm to optimize the partition structure of a large categorical feature space tailored towards minimizing the test error of a learning algorithm. This is a general methodology that can be applied to any supervised learning problem with a large categorical feature space. We show the benefits of the algorithm by applying this methodology to the problem of risk adjustment in competitive health insurance markets. We use a large claims data set that records ICD-10 codes, a large categorical feature space. We aim at improving risk adjustment by clustering diagnostic codes into risk groups suitable for health expenditure prediction. We test the performance of our methodology against common alternatives using panel data from a representative sample of twenty three million citizens in Colombian Healthcare System. Our results outperform common alternatives and suggest that it has potential to improve risk adjustment.

This publication has 14 references indexed in Scilit:

Risk selection in a regulated health insurance market: a review of the concept, possibilities and effects
Expert Review of Pharmacoeconomics & Outcomes Research, 2013
Risk equalization in The Netherlands: an empirical evaluation
Expert Review of Pharmacoeconomics & Outcomes Research, 2013
Local availability of physicians' services as a tool for implicit risk selection
Social Science & Medicine (1982), 2013
Biased selection within the social health insurance market in Colombia
Health Policy, 2006
Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data
PLoS Biology, 2004
Clinical Risk Groups (CRGs)
Medical Care, 2004
Partition-distance: A problem and class of perfect graphs arising in clustering
Information Processing Letters, 2002
Access to coverage for high-risks in a competitive individual health insurance market: via premium rate restrictions or risk-adjusted premium subsidies?
Journal of Health Economics, 2000
Simulated Annealing
Statistical Science, 1993
Generating a random permutation with random transpositions
Probability Theory and Related Fields, 1981