I am looking for a simple method that will capture the relationship between predictor variables and the variance of the outcome. As a simple reproducible example consider this code where I intentionally scale the variance of the response variable upon the value of a single predictor:
set.seed(1)
df <- data.frame(y = c(rnorm(50000),2*rnorm(50000),3*rnorm(50000)),
x = c(rep('a',50000),rep('b',50000),rep('c',50000)))
fit <- lm(y~x,df)
head(predict(fit, newdata=df[df$x=='a',], interval = 'predict'),1)
# fit lwr upr
# 1 -0.002440456 -4.248412 4.243531
head(predict(fit, newdata=df[df$x=='b',], interval = 'predict'),1)
# fit lwr upr
# 50001 -0.004095422 -4.250067 4.241876
head(predict(fit, newdata=df[df$x=='c',], interval = 'predict'),1)
# fit lwr upr
# 100001 0.01299273 -4.232979 4.258964
In my data above, the variance of the y is dependent upon the value of x. I was thinking perhaps the prediction interval would capture this relationship, but clearly it does not as the prediction interval is the same width regardless of the value of x.
My two questions are:
What is the statistics behind the prediction interval that would explain the same width for all observations?
Are there simple methods that could surface this relationship for me? That is, when the variance of y depends on values of several predictor values, what are recommended methods for discovering those relationships in a multi-variate way? Ideally, I could end up with a model and feed in any set of predictors and get a prediction interval specific to those predictor values.