Settle a bet: Errors in prediction / regression

Question

In my work, I am comparing "predicted" values to "theoretically true" values. To calculate one predicted value, I take some $N_{R_i}$ samples from an area in space, $R_i$ (i.e. the samples are from different locations in a region). I do some calculations on the samples to calculate a predicted value for each sample. Then I average the predictions to calculate an average predicted value for the region. There are around 10 regions in total, so 10 predicted values.

The theoretically true values are back-calculated using a very different methodology. Theoretically true values are only available many years after the original samples are taken (that's why we bother to make predictions). There can only be one theoretically true value per region $R_i$. That's why I averaged samples from within each region to compare to the single theoretically true value per region.

I hope you are with me so far.

So, what I have is a small data set of about 10 predicted and "true" values. The correlation between them is strong.

Now, what if we expand into a new region? I can take some new samples and make a prediction for that region.

I want to know how to calculate the uncertainty in my prediction.

I think I am after "Uncertainty in the mean" since my prediction is an average of $n$ samples.

$$ SE_x=\frac{s}{\sqrt{n}} $$ So, I think I can say that if I calculate a predicted value, $x$, there is a 68% chance for the "true value" to be within 1 $SE_x$ of $x$. Is this correct?

My colleague thinks we should be interested in the standard error for a predicted value:

$$ s_{y_p}=s^2_e\bigg(1+\frac{1}{n}+\frac{(x-\overline{x})^2}{\sum(x-\overline{x})^2}\bigg) $$ (she found that equation here: http://courses.ncssm.edu/math/Talks/PDFS/Standard%20Errors%20for%20Regression%20Equations.pdf)

Who is right? Or maybe a better question is: what will the 2nd formula give me that the first will not?

The second formula takes into account that a difference between the sample slope and the population slope will have its biggest impact on predictions based on X values very far from the mean of X. — David Lane, Jul 17 '17 at 21:56

score 5 · Accepted Answer · answered Jul 18 '17 at 00:09

Let's start with:

So, I think I can say that if I calculate a predicted value, x, there is a 68% chance for the "true value" to be within 1 $SE_x$ of x. Is this correct?

No, that is not correct unless you were using Bayesian analysis. There is either a zero percent or a one hundred percent chance the true value is in the region. There is no way to tell. Confidence intervals are not measures of the precision of your work. They do not measure uncertainty regarding the parameter. It is tied to the role chance plays in the sampling distribution.

What they do tell you is that if you were to perform the experiment an infinite number of times, then at least 68% of those experiments would contain the true value of the parameter within $\pm{1}SE.$ They provide no probability that the data is inside an interval. There are an infinite number of intervals for any given $\alpha$ level of confidence. There isn't just the one you gave. There are other ways that are also standard to construct error bars, such as nonparametric ones.

Your colleague is correct if the conditions of linearity, independent errors, normal errors, and homoscedasticity are met along with the requirement that the prediction is inside the valid set allowed in the scientific model. If not, then it is the wrong form for the predictive interval.

Obviously, the predictive interval is far wider than the confidence interval. It contains both the impact of chance on your parameter estimate plus the effect of chance in nature itself. Nature doesn't know you made a prediction.

There is a chapter in Decision Theory: Principles and Approaches by Giovanni Parmigiani on scoring rules for predictions. It is generic rather than applied though. If you were looking for "give me the formula to plug in," then this is the wrong book. The lecture you attached provides a discussion of predictions.

As to your question, "I want to know how to calculate the uncertainty in my prediction," then you need to use a Bayesian method. Bayesian methods are built around uncertainty, Frequentist methods are built around chance. They are subtly different things. In Pearson-Neyman methods, the null hypothesis and the sample space drive the math. In Bayesian methods, information and a model of nature drive the math.

Rather than just switch to Bayesian analysis, however, I recommend you learn a bit about both and ask yourself what you really want to know. They provide different things using the same data. They are not two different ways to solve a problem, they solve two different problems.

There is a good discussion, here on Stack Exchange, on the difference between Bayesian and Frequentist intervals. It is at What's the difference between a confidence interval and a credible interval?

It may provide you with some clarity with what you really want to ask and know, and the limits of what either method can provide.

+1. [Here](http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-differ/) is an article by Jake VanderPlas that has examples of the frequentist and Bayesian approaches to solving different kinds of statistical problems, including regression. — LmnICE, Jul 19 '17 at 10:56

score 1 · Answer 2 · edited Jul 24 '17 at 20:41

As Dave Harris pointed out, you're wrong to think that there's a 68% chance the true value lies in the interval. Aside from that, I think you're interested in the standard error of the mean for your problem. The difference in the two equations, put basically, is that the standard error of the mean is an estimate of the variability in the distribution of the sample mean. So you collect a sample and record the mean, but you could have collected any number of different samples with different recorded means. What's the variability in those means?

In contrast, the standard error for a predicted value would be an estimate of the variability in trying to predict a particular element in the population. Put another way, you've recorded your mean from your sample. How much uncertainty is there in predicting a new value with that mean.

Mathematically, it's the difference between VAR($\hat{\mu}$) and VAR($y* - \hat{\mu}$). The former only includes the variability of of the sample mean. The latter includes that and the variability in y.

Settle a bet: Errors in prediction / regression

2 Answers2