Is the variation in the residual standard deviation (on sample) accounted for when one builds a prediction interval (PI)?

Question

This question is somehow related to Is the residual, e, an estimator of the error, $\epsilon$?

I also found some information here: Confidence interval of RMSE

Let's say, I got a model that explains sampled X by mean(X).

Using suggestions from Mr. @whuber I reproduced calculations of the PI from predict.lm in R.

## lm PI

x_dat <- data.frame(x = rnorm(100, 0, 1))

lm_model <- lm(data = x_dat, x ~ 1)

summary(lm_model)

lm_preds <- predict.lm(lm_model
                       , x_dat
                       , interval = "prediction"
                       , level = 0.95
                       )

beta.hat <- lm_model$coefficients

# var-covariance matrix

V <- vcov(lm_model)

# mean prediction

Xp <- model.matrix(~ 1, x_dat)
pred <- as.numeric(Xp %*% beta.hat)[1]

# mean prediction variance:

pred_var <- unname(rowSums((Xp %*% V) * Xp))[1]

# confidence

t_stat <- qt((1 - 0.95)/2, df = lm_model$df.residual)

# residual MSE

res_mse <- sum(lm_model$residuals ^ 2) / lm_model$df.residual

# PI

PI <- pred + c(t_stat, -t_stat) * sqrt(pred_var + res_mse)

print(PI)

print(lm_preds[1, ])

> print(PI)
[1] -2.043175  2.572508
> 
> print(lm_preds[1, ])
       fit        lwr        upr 
 0.2646666 -2.0431747  2.5725079

I only have 2 questions.

Is it right that we assume that the true model error variance is that of sampled residual variance in order to make an unbiased estimate?

Given that we don't know what the true error variance is, does it mean we make a biased estimate of PI, in particular, by not adjusting for an additional dispersion of the residual variance? If so, can we assume that variance of the residual variance follows chi-square distribution in order to get an upper quantile of the value that can be supplied inside for an exact calculation?

predict.lm(...,
        pred.var = 
)

Let me add a clarification to my question. Assume my sample random variable X is strongly not normal. I use the sample mean of X to model X, assuming I want to understand the X on a population. Do I immediately stumble upon the problem with calculating the prediction interval, since my sample residuals follow the same non-normal distribution with zero mean? Do I face the lack of math to make a valid estimate of the dispersion of X given X-hat if the residuals are heavily skewed? If residuals are normal, is the prediction interval besomes effectively the estimation of quantiles of X given X-hat — Alexey Burnakov, Mar 15 '18 at 14:08
The confidence interval for $y(x)$ is a function of the sample. As such, it "includes variation in" every conceivable statistic computed from the sample. In light of this could you clarify your question? Your comment sounds so different from the question as posted that it would behoove you to modify the question substantially. — whuber, Mar 15 '18 at 14:08
@whuber, I have just added a clarifying comment. I mean the prediction interval, not the CI of y-hat. — Alexey Burnakov, Mar 15 '18 at 14:09
Either way, the comment seems to muddy the situation rather than clarify it. The PI is a statistic, just as the CI of $y(x)$ is. — whuber, Mar 15 '18 at 14:11
@whuber, OK, does prediction interval of y obtained on a limited-size sample contain a dispersion in the residual RMSE w.r.t. the model's errors on population? I tried to explain I wanted to understand that when we incorporate the residual RMSE on sample, does it mean we account for the fact it is an estimate of the expected standard deviation of the model's errors on population, and thus the PI we calculate is "wider"? — Alexey Burnakov, Mar 15 '18 at 14:20
Can I say that upper bound of PI = sample model estimate (mean) + SE of mean * t + (residual RMSE + RMSE dispersion) * t? Sorry for my barbaric language — Alexey Burnakov, Mar 15 '18 at 14:24
Why not just consult the formula? It shows you explicitly what is involved. — whuber, Mar 15 '18 at 14:39
@whuber, after reading help for **predict: **"The prediction intervals are for a single observation at each case in newdata (or by default, the data used for the fit) with error variance(s) pred.var. This can be a multiple of res.var, the estimated value of σ^2: the default is to assume that future observations have the same error variance as those used for fitting."** It seems that model's error variance is assumed to match residual variance. I guess that **pred.var** should be supplied for unbiased estimate. Question was what to do we do if we had no idea about this value? — Alexey Burnakov, Mar 15 '18 at 15:03
Might I suggest that the `R` help would be one of the last places you want to rely on for information about statistical procedures? Consider searching our site instead: https://stats.stackexchange.com/search?q=regression+prediction+interval. — whuber, Mar 15 '18 at 16:04
@whuber, thank you. I did lots of search on this site. Let me prepare a more detailed explanation. I think I got the idea, just need an expert verification. — Alexey Burnakov, Mar 15 '18 at 16:10
Here's a search that will turn up the relevant formulas in the top hits: https://stats.stackexchange.com/search?tab=votes&q=regression%20prediction%20interval%20formula. — whuber, Mar 15 '18 at 16:15
Your language belies a misunderstanding about statistical intervals, so you might want to focus on this. Intervals aren't "estimating" any properties. The concept of "bias" therefore scarcely applies. The defining characteristic of a prediction interval $\mathcal{I}(Y)$ for a random variable $X$ based on random data $Y$ is that $\Pr(X\in\mathcal{I}(Y))$ should equal a specified value. No estimate of any property of the probability law of $(X,Y)$ is necessarily involved--and even when estimates are used to construct PIs, what we care about is that the PIs satisfy this defining characteristic. — whuber, Mar 15 '18 at 17:13
@whuber, **"No estimate of any property of the probability law of (X,Y) is necessarily involved..."** I understand that. What I wanted to double check is that given that residual variance is an estimate of error variance, how come we accurately define PI for the **new** X coming from unseen population? **"...--and even when estimates are used to construct PIs, what we care about is that the PIs satisfy this defining characteristic."** Is not P(X belongs to interval around X-hat) overestimated if population variance of error is only estimated? — Alexey Burnakov, Mar 15 '18 at 17:32
It means I should revise my statistical reasoning. Thank you! — Alexey Burnakov, Mar 15 '18 at 19:20
true model error variance is that of sampled residual variance in order to make an unbiased estimate? If I understand the question correctly then no. The error observed will be a blend of error sources including sampling error as you note, but also measurement error (both in predictors and in reference value, and which are themselves a blend of many sources of error). This means that the residual will be a compound of true model error plus errors of sampling and measurement. The residuals will help estimate measurement errors, you need sample reps to predict sampling variance. — ReneBt, Mar 16 '18 at 09:57
@ReneBt, thank you. Does it mean intuitively that the sampling variance in model's parameters (mean, slope) help increase the width of the prediction interval such that is matches the true interval generated by the true model errors? — Alexey Burnakov, Mar 16 '18 at 10:11
PI width should reflect the sum total of all variances under the condition that any future predictions follow exactly the same sampling and analysis protocols, with same measurement error achieved. This is to say that it should approximate the true interval in the same way as experimentally calculated values approximate the true underlying characteristics of a population. If no sources of variation are biased then PI should be unbiased estimate of true model error. However, if any are biases this will not apply (e.g. if volume available to sample is partially correlated with true value of Y). — ReneBt, Mar 16 '18 at 10:37
@ReneBt, what if my residuals are not normally shaped? Imagine any sample distribution which is strongly not normal, and the residuals are differences from the sample mean... Can I go on with PI calculation? BTW, you can give a longer **answer** to it. — Alexey Burnakov, Mar 16 '18 at 10:43

Gregg H · Accepted Answer · 2018-03-22T10:32:56.380

Referencing the initial post, for Q1: yes, the error variance is estimated (without bias if the model is correct) by an appropriate adjustment to the sample residuals. The bias is corrected with division by a value smaller than $df=n-1$, based on the parameters in the model. (E.g., this is a question students ask a lot: ¿why do we have to divide by $n-2$ instead of $n-1$ for the variance/standard deviation of the residuals for a bivariate model?) Thus, we have a “reasonable” estimate for the variance

for Q2: First, if there were bias, then the entire interval would be shifted by the bias...so adjustment via chi-square bounds probably is not appropriate. Second, to account for the amount of uncertainty, the prediction interval (PI) is slightly elongated. Again, this references back to the smaller df used to obtain the critical value for this margin of error.

I hope this answers the query...happy to provide more detail if it seems appropriate.

Addendum #1
The standard error for a predicted value from a regression (using bivariate example here) is $$\hat\sigma \sqrt{\frac{1}{n} + \frac{(x_0-\bar{x})^2}{(n-1)s_x^2}}$$ where $\hat\sigma^2$ is the estimated residual variance, $n$ is the sample size, $x_0$ is the value at which you want the prediction, and $s_x^2$ is the variance of the $x$ values. But to clarify, this is the standard error for the predicted value $\hat{y}$ at $x_0$. As mentioned in Kleinbaum et al. (2014), this can be used to calculate the confidence interval for the parameter (which in this case is the average predicted value at $x_0$). To account for the fact that we often have to use an estimate for $\sigma$ ($\hat\sigma$ in this case), to obtain the margin of error, we multiple this by a slightly larger multiplication factor to accommodate that extra element of uncertainty. (In this case, we use a smaller $df=n-2$ which in turn makes $t_{c.v.}$ slightly larger.)

When we move to the next level of estimation, and ask for an interval in which we might observe values of $y$ at $x_0$, then we need to account for the variance of the parameter estimate and the variance of the distribution of errors. However, by doing so, we are no longer talking about a confidence interval, as we are talking about a random variable and not a parameter; we call this a prediction interval. Essentially, the variance used in calculating the “margin of error” would be the sum of the variance associated with the parameter estimate and the variance associated with the residuals. This is captured by adding a one under the radical: $$\hat\sigma \sqrt{1 + \frac{1}{n} + \frac{(x_0-\bar{x})^2}{(n-1)s_x^2}}$$ Conceptually, the 1 “adds” the estimate for $\sigma$ to the standard error.

Could you, please, elaborate more on the connection between the variance in linear combination of inputs multiplied by model weights, and the broadening of the PI? On population we don't have this source of variance. We only have the model error variance. While on a sample we have this source of variance (sampling variance) atop of the residual variance. So it comes out in the end that the CI of the prediction (not counting residuals) needs to be taken in the account when talk about the best estimate for the PI (which is related to model errors only). — Alexey Burnakov, Mar 21 '18 at 15:37
I'll attempt to address this in Addendum #1 to my answer above. — Gregg H, Mar 21 '18 at 21:21
Full reference for Addendum #1: Kleinbaum, Kupper, Nizam & Rosenberg (2014). Applied regression analysis and other multivariable methods, 5th ed. Cengage. A more detailed explanation can be found on pp.69ff. — Gregg H, Mar 21 '18 at 21:42
thank you a lot. In the last equation, should not the σ be σ-hat? And also I wonder if this equation is for the intercept-only case, as there is no other coefficients involved? — Alexey Burnakov, Mar 22 '18 at 10:02
Good catch...that would be an estimated value (with hat). This example is for a bivariate (single variable, with slope & intercept). Happy to provide the matrix form for multiple regression models. — Gregg H, Mar 22 '18 at 10:34
thank you. I will try to apply this equation to my dummy data. I try to simplify model by using intercept only. — Alexey Burnakov, Mar 22 '18 at 10:46

score 0 · Answer 2 · answered Mar 16 '18 at 10:23

I think the truth is that I thought incorrectly.

The following simulation shows that under normal error assumption (normal X actually) the proportion of examples falling out of a 95% PI on a population is not different from 5%.

If I make something stupid, please correct / post an answer.

start_time <- Sys.time()

popul <- 1000000
sam_size <- 25
sim_num <- 10000

big_x <- rnorm(popul, 0, 1)

outofpi <- numeric()

for(i in 1:sim_num)
{

     sam_ind <- sample(popul, sam_size, replace = F)

     x_dat <- data.frame(x = big_x[sam_ind])

     lm_model <- lm(data = x_dat, x ~ 1)

     lm_preds <- predict.lm(lm_model
                            , x_dat
                            , interval = "prediction"
                            , level = 0.95
                            )[1, ]

     newd <- big_x[-sam_ind]

     outofpi[i] <- length(newd[newd < lm_preds[2] | newd > lm_preds[3]]) / length(newd)

}

Sys.time() - start_time

hist(outofpi, breaks = 'fd')

t.test(x = outofpi
       , alternative = "greater"
       , mu = 0.05
       , conf.level = 0.95
       )


    One Sample t-test

data:  outofpi
t = 0.54362, df = 9999, p-value = 0.2934
alternative hypothesis: true mean is greater than 0.05
95 percent confidence interval:
 0.04961823        Inf
sample estimates:
 mean of x 
0.05018844

Is the variation in the residual standard deviation (on sample) accounted for when one builds a prediction interval (PI)?

2 Answers2