Can We Train Machine Learning Methods to Outperform the High-dimensional Propensity Score Algorithm?
The use of retrospective health care claims datasets is frequently criticized for the lack of complete information on potential confounders. Utilizing patient’s health status–related information from claims datasets as surrogates or proxies for mismeasured and unobserved confounders, the high-dimensional propensity score algorithm enables us to reduce bias. Using a previously published cohort study of postmyocardial infarction statin use (1998–2012), we compare the performance of the algorithm with a number of popular machine learning approaches for confounder selection in high-dimensional covariate spaces: random forest, least absolute shrinkage and selection operator, and elastic net. Our results suggest that, when the data analysis is done with epidemiologic principles in mind, machine learning methods perform as well as the high-dimensional propensity score algorithm. Using a plasmode framework that mimicked the empirical data, we also showed that a hybrid of machine learning and high-dimensional propensity score algorithms generally perform slightly better than both in terms of mean squared error, when a bias-based analysis is used. This talk is based on a joint work with Menglan Pang and Robert W Platt from McGill University [Epidemiology: 2018, 29(2):191–198]. The talk should be accessible to epidemiologists as well.