Why does Prediction Interval of lm function in R Return a Static Interval

Question

I am looking for a simple method that will capture the relationship between predictor variables and the variance of the outcome. As a simple reproducible example consider this code where I intentionally scale the variance of the response variable upon the value of a single predictor:

set.seed(1)
df <- data.frame(y = c(rnorm(50000),2*rnorm(50000),3*rnorm(50000)),
                 x = c(rep('a',50000),rep('b',50000),rep('c',50000)))
fit <- lm(y~x,df)

head(predict(fit, newdata=df[df$x=='a',], interval = 'predict'),1)
#           fit       lwr      upr
# 1 -0.002440456 -4.248412 4.243531

head(predict(fit, newdata=df[df$x=='b',], interval = 'predict'),1)
#           fit       lwr      upr
# 50001 -0.004095422 -4.250067 4.241876

head(predict(fit, newdata=df[df$x=='c',], interval = 'predict'),1)
#           fit       lwr      upr
# 100001 0.01299273 -4.232979 4.258964

In my data above, the variance of the y is dependent upon the value of x. I was thinking perhaps the prediction interval would capture this relationship, but clearly it does not as the prediction interval is the same width regardless of the value of x.

My two questions are:

What is the statistics behind the prediction interval that would explain the same width for all observations?
Are there simple methods that could surface this relationship for me? That is, when the variance of y depends on values of several predictor values, what are recommended methods for discovering those relationships in a multi-variate way? Ideally, I could end up with a model and feed in any set of predictors and get a prediction interval specific to those predictor values.

If variance depends on x, then you're using wrong model since linear regression assumes homoscedastic variance... — Tim, May 11 '18 at 06:05

Tim · Answer 1 · 2018-05-11T09:42:56.940

The linear regression model is

$$ Y = \beta_0 + \beta_1 X_1 + \dots + \beta_k X_k +\varepsilon \\ \varepsilon \sim \mathcal{N}(0, \sigma^2) $$

Where errors $\varepsilon$ are assumed to be homoscedastic, i.e. constant and independent of the predictors. If the assumption is inconsistent with your data, you're using the wrong model. Confidence intervals and prediction intervals are build based on this assumption.

If you need a model for a data with variance that depends on predictors, you need generalizations of linear regression. Such models can be estimated using the nlme and lme4 R packages.

For more details check the great book by authors of the nlme package

Pinheiro, J.C., and Bates, D.M. (2000). Mixed-Effects Models in S and S-PLUS. Springer.

Why does Prediction Interval of lm function in R Return a Static Interval

1 Answers1