Simultaneous heteroscedasticity and heavy tails in a regression model

Question

I'm trying to create a prediction model using regression. This is the diagnostic plot for the model that I get from using lm() in R: diagnostic plots from R

What I read from the Q-Q plot is that the residuals have a heavy-tailed distribution, and the Residuals vs Fitted plot seems to suggest that the variance of the residuals is not constant. I can tame the heavy tails of the residuals by using a robust model:

fitRobust = rlm(formula, method = "MM", data = myData)

But that's where things come to a stop. The robust model weighs several points 0. After I remove those points, this is how the residuals and the fitted values of the robust model look like: Residuals vs Fitted for the robust model

The heteroscedasticity seems to be still there. Using

logtrans(model, alpha)

from the MASS package, I tried to find an $\alpha$ such that

rlm(formula, method = "MM")

with formula being $\log(Y + \alpha) \sim X_1+\cdots+X_n$ has residuals with constant variance. Once I find the $\alpha$, the resulting robust model obtained for the above formula has the following Residuals vs Fitted plot:

Residuals vs Fitted for log-transformed response

It looks to me as if the residuals still do not have constant variance. I've tried other transformations of response (including Box-Cox), but they don't seem like an improvement either. I am not even sure that the second stage of what I'm doing (i.e. finding a transformation of the response in a robust model) is supported by any theory. I'd very much appreciate any comments, thoughts, or suggestions.

I think you're being a bit picky about the non-constant variance. It appears ok to me. What is the purpose of the regression? Explanation/hypothesis testing or prediction? — probabilityislogic, Sep 25 '12 at 07:38
@probabilityislogic, thank you for your comment. I very much appreciate it. My goal is prediction. You're right. I'm probably being too picky. Is there a measure for heteroscedasticity that I can look at? I thought of plotting variance vs fitted values but there aren't many points for each predicted value to calculate variance. I'm also curious to understand what is the solution to this problem in general. Are Box-Cox and log transforms applicable to robust models as well? — user765195, Sep 25 '12 at 12:32
You can do pairwise test for equality of variances using the F test for a model with Gaussian error terms or if they have a non-Gaussian distribution there are robust tests for dispersion such as Levene's test. — Michael R. Chernick, Sep 25 '12 at 15:30
Thank you @MichaelChernick. I very much appreciate your comment. I finally used Koenker's generalization of Breusch-Pagan's test for heteroscedasticity as implemented in lmtest package in R (http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/lmtest/html/bptest.html). — user765195, Sep 26 '12 at 03:15

AdamO · Answer 1 · 2017-12-22T19:07:36.500

Heteroscedasticity and leptokurtosis are easily conflated in data analysis. Take a data model which generates an error term as Cauchy. This meets the criteria for homoscedasticty. The Cauchy distribution has infinite variance. A Cauchy error is a simulator's way of including an outlier-sampling process.

With these heavy tailed errors, even when you fit the correct mean model, the outlier leads to a large residual. A test of heteroscedasticity has greatly inflated type I error under this model. A Cauchy distribution also has a scale parameter. Generating error terms with a linear increase in scale produces heteroscedastic data, but the power to detect such effects is practically null so the type II error is inflated as well.

Let me suggest then, the proper data analytic approach isn't to become mired in tests. Statistical tests are primarily misleading. No where is this more obvious than tests intended to verify secondary modeling assumptions. They are no substitution for common sense. For your data, you can plainly see two large residuals. Their effect on the trend is minimal as few if any residuals are offset in a linear departure from the 0 line in the plot of residuals vs. fitted. That is all you need to know.

What is desired then is a means of estimating a flexible variance model that will allow you to create prediction intervals over a range of fitted responses. Interestingly, this approach is capable of handling most sane forms of both heteroscedasticity and kurtotis. Why not then use a smoothing spline approach to estimating the mean squared error.

Take the following example:

set.seed(123)
x <- sort(rexp(100))
y <- rcauchy(100, 10*x)

f <- lm(y ~ x)
abline(f, col='red')
p <- predict(f)
r <- residuals(f)^2

s <- smooth.spline(x=p, y=r)

phi <- p + 1.96*sqrt(s$y)
plo <- p - 1.96*sqrt(s$y)

par(mfrow=c(2,1))
plot(p, r, xlab='Fitted', ylab='Squared-residuals')
lines(s, col='red')
legend('topleft', lty=1, col='red', "predicted variance")

plot(x,y, ylim=range(c(plo, phi), na.rm=T))
abline(f, col='red')
lines(x, plo, col='red', lty=2)
lines(x, phi, col='red', lty=2)

Gives the following prediction interval that "widens up" to accommodate the outlier. It is still a consistent estimator of the variance and usefully tells people, "Hey there's this big, wonky observation around X=4 and we can't predict values very usefully there."

Would this work for other types of lms, such as gls? – user2974951 Feb 27 '20 at 08:13 — user2974951, Feb 27 '20 at 08:13

Simultaneous heteroscedasticity and heavy tails in a regression model

1 Answers1

Linked