Understanding the impact of heteroscedasticity on the predictive ability of modern regression methods
As the size and complexity of modern data sets grows, more and more prediction methods are developed. Despite the growing sophistication of methods, there is not a well-developed literature on how heteroscedasticity effects modern regression methods. We aim to understand the impact of heteroscedasticity on the predictive ability of modern regression methods. We accomplish this by reviewing the visualization and diagnosis of heteroscedasticity, as well as developing a measure for quantifying it. These methods are used on 42 real data sets in order to understand the prevalence and magnitude ``typical'' to data. We use the knowledge from this analysis to develop a simulation study that explores the predictive ability of nine regression methods. We vary a number of factors to determine how they influence prediction accuracy in conjunction with, and separately from, heteroscedasticity. These factors include data linearity, the number of explanatory variables, the proportion of unimportant explanatory variables, and the signal-to-noise ratio. We compare prediction accuracy with and without a variance-stabilizing log-transformation. The predictive ability of each method is compared by using the mean squared error, which is a popular measure of regression accuracy, and the mean absolute standardized deviation, a measure that accounts for the potential of heteroscedasticity.
Keywords: Heteroscedasticity; regression; regression trees; random forests; Bayesian adaptive regression trees; artificial neural networks; multivariate adaptive regression splines