2

I've done a multivariate linear regression. The results specify each parameter and the 95% confidence interval for each parameter. I did this using Python and StatsModels (not that it matters), and the results are for example:

                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   4.971e+04   1575.998     31.541      0.000      4.66e+04  5.28e+04
hdd          163.1509     35.301      4.622      0.000        93.350   232.951
cdd          879.7969     76.879     11.444      0.000       727.784  1031.810
occ          177.8679     20.619      8.627      0.000       137.099   218.637

Based on this, the best fitting result is:

y = 4.971e+04  +  163.1509 * hdd  +  879.7969 * cdd  +  177.8679 * occ

My question is, if I were to write an equation for the upper bound and one for the lower bound based on the confidence interval described above, would it be simply:

y_max = 5.28e+04  +  232.951 * hdd  +  1031.810 * cdd  +  218.637 * occ
y_min = 4.66e+04  +   93.350 * hdd  +   727.784 * cdd  +  137.099 * occ

So, do I just take all the coefficients from the 95% confidence section and plug them into the equation?

EDIT: A little clarification: I'm trying to write the equations that allow me to say, "with 95% probability, the data points lie between equation A and equation B".

Heliodor
  • 121
  • 5
  • 1
    No, that's not how to do it. In fact the curves defining the (pointwise) upper and lower confidence intervals for the conditional mean aren't even straight lines (/planes/hyperplanes). See the explanation [here](http://stats.stackexchange.com/questions/85560/shape-of-confidence-interval-for-predicted-values-in-linear-regression/85565#85565) for some simple intuition as to why. – Glen_b Apr 09 '15 at 00:49
  • @Glen_b, the link only talks about why the CI curves, not whether it represents the 95% confidence interval. In fact, if the tails are wider but the center is tighter, you could still maintain the 95% confidence interval... It's just that it wouldn't be spread evenly over the data points... Do you agree? – justanotherbrain Apr 09 '15 at 01:14
  • 1
    No, since the intervals are *pointwise* intervals. At each given value for $x$, the intervals have 95% coverage (if the assumptions hold); the link gives the *intuition* as to why a pointwise interval must curve. If you're not after pointwise intervals you should be more explicit. – Glen_b Apr 09 '15 at 01:29
  • See [here](http://stackoverflow.com/a/17560456/330679) for discussion of how to get pointwise CIs (i.e. the usual kind) when doing multiple regression in python – Glen_b Apr 09 '15 at 01:30
  • @Glen_b, thanks for the links. The explanation is very intuitive once I give it a minute to sink in. And yes, I'm looking to get the pointwise 95%. It's just like in the graphs at that link, but with three independent variables instead of one. I thought I was being pretty specific with my question, but you guys take it to the next level! Awesome. – Heliodor Apr 09 '15 at 01:36
  • Heliodor ... whoah, wait, if you want an interval for *data* (rather than means), that's $\textit{not}$ a confidence interval. Do you want a prediction interval? (i.e. an interval for an unseen/future observation at a known set of x's) Or do you want some other kind of interval? See [here](http://stats.stackexchange.com/questions/70410/confidence-intervals-for-regression-interpretation). Can you clarify more precisely what the situation is with these points you want an interval for? – Glen_b Apr 09 '15 at 02:06
  • Ok, it's clear now that I'm looking for the prediction interval. I'm modeling how building energy usage varies with temperature and occupancy (leases). Now that the building is operating on new terms trying to use less energy, we're taking subsequent data points to see what energy savings the data indicates. The r-square is only 0.78, so I'm trying to measure how far outside the prediction interval we are, though it looks like we're far inside it. – Heliodor Apr 09 '15 at 02:37
  • @Glen_b *Pointwise* ah, yes, I agree with you - I'm embarrassed to say that I had my definition of confidence interval confused with prediction interval. – justanotherbrain Apr 09 '15 at 17:54
  • 1
    Heliodor; if you want to compare more than one point after an intervention with the points before, one thing you might do is fit a model to both (an indicator for `after` will pick up a shift up or down - and if you have lots of after points and expect something other than a plain shift - interactions of it with other predictors will pick up changes in the linear coefficients). Then an overall test of all coefficients involving `after` being zero gives a test of whether there has been any change across the intervention. However, one thing concerns me, which is possible dependence over time. – Glen_b Apr 09 '15 at 23:10
  • the confidence intervals for the prediction are given on slide 5/17 on this link http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf –  Aug 28 '16 at 10:08

2 Answers2

1

There are 2 issues that you need to understand. First is that your equations do not take into account the correlation between the coefficient estimates and the second is the difference between a confidence interval and a prediction interval.

The confidence interval tells where you think the mean response will be for a given set of x-values. Much fewer than 95% of your observations (and future observations) will fall within the confidence bands. What you are asking for is a prediction interval, which tells where a future single value is likely to fall.

There are many resources that discuss prediction and confidence intervals. Regression textbooks will give the formulas to use along with more detailed explanation. One online resource (there are many, this is just one I found to point to) is https://onlinecourses.science.psu.edu/stat501/node/315 which refers to a formula on the previous page (section 7.1) that you may need to click on to understand the full formula.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
0

Are you referring to an upper bound in terms of the loss function that you used (i.e., a general bound)? Or do you mean an upper bound in terms of your data (which, in the real world, is not usually representative of the population)?

If you mean in terms of your data: By definition of an upper and lower bound, you cannot just take the 95% confidence interval UNLESS you say something like "with probability at least 97.5%, the upper bound is 5.28e4..." and same for the lower band. (Addendum: This is for the prediction interval over your data)

The reason is that a hard upper bound says that no points can ever be greater than (or lower than) that bound. However, by definition of a 95% confidence interval, at least 2.5% is above and another 2.5% below the bound.

If you mean a theoretical one (in terms of the loss function): then you will have to provide more information: the loss function you used, the kernel you used (if any), distributional assumptions that you're making, etc.

Make sense?