24

If the best linear approximation (using least squares) of my data points is the line $y=mx+b$, how can I calculate the approximation error? If I compute standard deviation of differences between observations and predictions $e_i=real(x_i)-(mx_i+b)$, can I later say that a real (but not observed) value $y_r=real(x_0)$ belongs to the interval $[y_p-\sigma, y_p+\sigma]$ ($y_p=mx_0+b$) with probability ~68%, assuming normal distribution?

To clarify:

I made observations regarding a function $f(x)$ by evaluating it a some points $x_i$. I fit these observations to a line $l(x)=mx+b$. For $x_0$ that I did not observe, I'd like to know how big can $f(x_0)-l(x_0)$ be. Using the method above, is it correct to say that $f(x_0) \in [l(x_0)-\sigma, l(x_0)+\sigma]$ with prob. ~68%?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
bmx
  • 241
  • 2
  • 4
  • 1
    I think you are asking about prediction intervals. Note, however, that you use "$x_i$", instead of "$y_i$". Is this a typo? We *don't* predict $x$s. – gung - Reinstate Monica Jul 31 '12 at 19:29
  • @gung: I use $x$ to denote for example time, and $y$ the value of some variable at that time, so $y=f(x)$ means that I made an observation $y$ at time $x$. I want to know how far the fitting function predictions can be from the real values of y. Does that make sense? Function $real(x_i)$ returns the "correct" value of $y$ at $x_i$, and my data points consist of ${(x_i, real(x_i))}$. – bmx Jul 31 '12 at 19:36
  • 1
    That seems perfectly reasonable. The parts I'm focusing on are, eg, "$e_i=real(x_i)-(mx_i+b)$", usually we think of the errors / residuals in a reg model as"$e_i=y_i-(mx_i+b)$". The SD of the residuals *does* play a role in calculating prediction intervals. It's that "$x_i$" that's weird to me; I'm wondering if it's a typo, or you're asking about something I don't recognize. – gung - Reinstate Monica Jul 31 '12 at 19:43
  • I think I see; I missed your edit. This suggests that the system is perfectly deterministic & if you had access to the *real* underlying function, you could always predict $y_i$ perfectly w/o error. That's not the way we typically think about reg models. – gung - Reinstate Monica Jul 31 '12 at 19:46
  • @gung: Does this question make sense if I'm observing a nondeterministic algorithm, instead of a function? I'm not shure how to reformulate my question, so that it makes sense for statisticians :(. I want to estimate how good the linear model approximates the observed thing, assuming that this thing is linear, but deviates from a line with a random error. – bmx Jul 31 '12 at 19:56
  • I think you may have a fundamental misunderstanding about what prediction intervals and confidence intervals are. I feel that way based on the initial phrasing of the problem. Gung is doing a good job trying to explain things to you. If you could be more precise about your question we may be able to give you a clearer answer and correct any confusion if it exists. – Michael R. Chernick Jul 31 '12 at 22:40
  • 5
    bmx, It looks to me like you have a clear idea of your question and a good awareness of some of the issues. You might be interested to review three closely related threads. http://stats.stackexchange.com/questions/17773 describes prediction intervals in nontechnical terms; http://stats.stackexchange.com/questions/26702 gives a more mathematical description; and in http://stats.stackexchange.com/questions/9131, Rob Hyndman provides the formula you seek. If these don't fully answer your question, at least they may give you a standard notation and vocabulary for clarifying it. – whuber Aug 01 '12 at 15:20

1 Answers1

32

@whuber has pointed you to three good answers, but perhaps I can still write something of value. Your explicit question, as I understand it, is:

Given my fitted model, $\hat y_i=\hat mx_i + \hat b$ (notice I added 'hats'), and assuming my residuals are normally distributed, $\mathcal N(0, \hat\sigma^2_e)$, can I predict that an as yet unobserved response, $y_{new}$, with a known predictor value, $x_{new}$, will fall within the interval $(\hat y -\sigma_e, \hat y +\sigma_e)$, with probability 68%?

Intuitively, the answer seems like it should be 'yes', but the true answer is maybe. This will be the case when the parameters (i.e., $m, b,$ & $\sigma$) are known and without error. Since you estimated these parameters, we need to take their uncertainty into account.

Let's first think about the standard deviation of your residuals. Because this is estimated from your data, there can be some error in the estimate. As a result, the distribution you should use to form your prediction interval should be $t_\text{df error}$, not the normal. However, since the $t$ converges rapidly to the normal, this is less likely to be a problem in practice.

So, can we just use $\hat y_\text{new}\pm t_{(1-\alpha/2,\ \text{df error})}s$, instead of $\hat y_\text{new}\pm z_{(1-\alpha/2)}s$, and go about our merry way? Unfortunately, no. The bigger issue is that there is uncertainty about your estimate of the conditional mean of the response at that location due to the uncertainty in your estimates $\hat m$ & $\hat b$. Thus, the standard deviation of your predictions needs to incorporate more than just $s_\text{error}$. Because variances add, the estimated variance of the predictions will be: $$ s^2_\text{predictions(new)}=s^2_\text{error}+\text{Var}(\hat mx_\text{new}+\hat b) $$ Notice that the "$x$" is subscripted to represent the specific value for the new observation, and that the "$s^2$" is correspondingly subscripted. That is, your prediction interval is contingent on the location of the new observation along the $x$ axis. The standard deviation of your predictions can be more conveniently estimated with the following formula: $$ s_\text{predictions(new)}=\sqrt{s^2_\text{error}\left(1+\frac{1}{N}+\frac{(x_\text{new}-\bar x)^2}{\sum(x_i-\bar x)^2}\right)} $$ As an interesting side note, we can infer a few facts about prediction intervals from this equation. First, prediction intervals will be narrower the more data we had when we built the prediction model (this is because there's less uncertainty in $\hat m$ & $\hat b$). Second, predictions will be most precise if they are made at the mean of the $x$ values you used to develop your model, as the numerator for the third term will be $0$. The reason is that under normal circumstances, there is no uncertainty about the estimated slope at the mean of $x$, only some uncertainty about the true vertical position of the regression line. Thus, some lessons to be learned for building prediction models are: that more data is helpful, not with finding 'significance', but with improving the precision of future predictions; and that you should center your data collection efforts on the interval where you will need to be making predictions in the future (to minimize that numerator), but spread the observations as widely from that center as you can (to maximize that denominator).

Having calculated the correct value in this manner, we can then use it with the appropriate $t$ distribution as noted above.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • In the case, for the variances to add, don't you have to assume the residuals are uncorrelated with $\hat{m}x + \hat{b}$, otherwise you'll need to include a covariance term? But assumption seems incorrect since the residuals are independent on the prediction. – 24n8 Jul 11 '20 at 16:57