19

I am reading this note.

On page 2, it states:

"How much of the variance in the data is explained by a given regression model?"

"Regression interpretation is about the mean of the coefficients; inference is about their variance."

I have read about such statements numerous times, why would we care about "how much of the variance in the data is explained by the given regression model?"... more specifically, why "variance"?

Luna
  • 2,255
  • 5
  • 27
  • 38
  • "[V]ariance" as opposed to what, the standard deviation? What is it that you think we ought to care about in regression? What are your typical goals in building a regression model? – gung - Reinstate Monica Jul 21 '12 at 16:20
  • The variance has different units than the quantity being modeled, so I've always found it hard to interpret the "proportion of variance explained by the model". – flies Nov 21 '16 at 23:14

2 Answers2

18

why would we care about "how much of the variance in the data is explained by the given regression model?"

To answer this it is useful to think about exactly what it means for a certain percentage of the variance to be explained by the regression model.

Let $Y_{1}, ..., Y_{n}$ be the outcome variable. The usual sample variance of the dependent variable in a regression model is $$ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2 $$ Now let $\widehat{Y}_i \equiv \widehat{f}({\boldsymbol X}_i)$ be the prediction of $Y_i$ based on a least squares linear regression model with predictor values ${\boldsymbol X}_i$. As proven here, this variance above can be partitioned as:
$$ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2 = \underbrace{\frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \widehat{Y}_i)^2}_{{\rm residual \ variance}} + \underbrace{\frac{1}{n-1} \sum_{i=1}^{n} (\widehat{Y}_i - \overline{Y})^2}_{{\rm explained \ variance}} $$

In least squares regression, the average of the predicted values is $\overline{Y}$, therefore the total variance is equal to the averaged squared difference between the observed and the predicted values (residual variance) plus the sample variance of the predictions themselves (explained variance), which are only a function of the ${\boldsymbol X}$s. Therefore the "explained" variance may be thought of as the variance in $Y_i$ that is attributable to variation in ${\boldsymbol X}_i$. The proportion of the variance in $Y_i$ that is "explained" (i.e. the proportion of variation in $Y_i$ that is attributable to variation in ${\boldsymbol X}_i$) is sometimes referred to as $R^2$.

Now we use two extreme examples make it clear why this variance decomposition is important:

  • (1) The predictors have nothing to do with the responses. In that case, the best unbiased predictor (in the least squares sense) for $Y_i$ is $\widehat{Y}_i = \overline{Y}$. Therefore the total variance in $Y_i$ is just equal to the residual variance and is unrelated to the variance in the predictors ${\boldsymbol X}_i$.

  • (2) The predictors are perfectly linearly related to the predictors. In that case, the predictions are exactly correct and $\widehat{Y}_i = Y_i$. Therefore there is no residual variance and all of the variance in the outcome is the variance in the predictions themselves, which are only a function of the predictors. Therefore all of the variance in the outcome is simply due to variance in the predictors ${\boldsymbol X}_i$.

Situations with real data will often lie between the two extremes, as will the proportion of variance that can be attributed to these two sources. The more "explained variance" there is - i.e. the more of the variation in $Y_i$ that is due to variation in ${\boldsymbol X}_i$ - the better the predictions $\widehat{Y}_{i}$ are performing (i.e. the smaller the "residual variance" is), which is another way of saying that the least squares model fits well.

Macro
  • 40,561
  • 8
  • 143
  • 148
  • This is like my answer but perhaps a little bit better explained. Also I see a possible critque that could have been mention is that I should have written the variation relative to the mean of Y. – Michael R. Chernick Jul 07 '12 at 05:32
  • 1
    @MichaelChernick, yes but in least squares regression (which I think the OP is talking about based on the linked slides), the mean of the predicted values equals the mean of the $Y$s, so you can just call it the sample variance of the predictions. – Macro Jul 07 '12 at 05:34
  • I made the edit to my answer because Yb is needed for the variance decomposition to work properly. – Michael R. Chernick Jul 07 '12 at 05:40
  • Yes it was clear to me that she was referring to least squares regression. Still a lot of what you wrote is just repeating what I said slightly differently. I still gave you a +1. – Michael R. Chernick Jul 07 '12 at 05:47
  • Did you mean "(e.g., least squares)" or "(*i.e.*, least squares)"? The former gives the, perhaps somewhat mistaken, impression that least-squares is but one of many approaches that leads to such a variance decomposition. – cardinal Jul 07 '12 at 13:49
  • @cardinal, the decomposition of the variance into those two parts is still possible without using least squares regression but most of the subsequent interpretation I give only applies when $\widehat{Y}_i$ is a least squares prediction (since $\overline{Y}$ will not necessarily be the mean of the predictions in general models), so that's why I called it an "e.g." – Macro Jul 07 '12 at 14:15
  • 1
    Macro, my point was that this decomposition occurs only if $\langle \mathbf y - \hat {\mathbf y}, \hat{\mathbf{y}} - \bar{y} \mathbf{1} \rangle = 0$ and so the "regression" inherently involves an orthogonal projection onto a space containing the constant vector. Note that we can easily "break" this decomposition by simply removing the constant vector from our model, which seems in conflict with your most recent comment. – cardinal Jul 07 '12 at 14:29
  • @cardinal, you're right. For some reason I had the idea that this decomposition makes sense in greater generality but after looking again at its proof I can see that it doesn't hold in general. I've edited a bit to make it clear I'm talking about least squares regression. – Macro Jul 07 '12 at 15:43
  • It is not unusual to still consider $1 - \|\mathbf y - \hat{\mathbf y}\|^2/\|\mathbf y - \bar y \mathbf 1\|^2$ as a measure of variation explained in more general models (e.g., [here](http://stats.stackexchange.com/q/7357/2970)), even though it doesn't have quite as nice of an interpretation as in this case. :) – cardinal Jul 07 '12 at 16:06
  • @Macro, I think this is quite good, but I have 2 questions. (1) In your top equation, called "sample variance" you divide by $n$, shouldn't that be $n-1$? (2) I always understood $R^2$ to be related to the decomposition of the *sum of squares* (eg, that's what the linked proof is discussing), but you appear to be working w/ *mean squares* in your second equation, does this equality hold in general w/ mean squares? – gung - Reinstate Monica Jul 21 '12 at 16:59
  • @gung, thanks for the comments. (1) What I've written there is the maximum likelihood estimator of the variance. You may be right that the convention is that 'sample variance' refers to the unbiased estimator. If I replace all of the $n$s with $(n-1)$s, nothing would change. (2) All I've done is divide through the equality by $n$ so that the decomposition can be thought of in terms of decomposing variance rather than sums of squares. – Macro Jul 21 '12 at 17:02
  • @gung, re: (1) note that is the MLE of the variance in the normal case. In general, though, dividing through by $n$ rather than $n-1$ gives a smaller MSE in estimating the variance (whether or not MSE is the proper measure of performance is another question entirely). – Macro Jul 21 '12 at 17:08
  • @Macro, I can see you're right. I was thrown off by the last sentence in the paragraph after the 2nd equation. At that point in the standard lecture that this answer emulates, I note that $R^2=SS_{reg}/SS_{tot}$, & reading quickly, I was left w/ the impression you were saying $R^2=MS_{reg}/MS_{tot}$, which students always assume & which you need to explicitly point out is *not* true. I'm also used to seeing / stressing that *sample* variance divides by $n-1$. The situation w/ the OP, or other future readers, may be similar. I wonder if a clarifying note or word of warning is in order? – gung - Reinstate Monica Jul 21 '12 at 17:20
  • @gung, I'm not totally clear what you're defining as $MS_{reg}$. If it is $MS_{reg} = SS_{reg}/n$ (and analogously for $tot$) then it seems that identity is true, since you're just multiplying by $n/n=1$. Can you clarify? – Macro Jul 21 '12 at 17:24
  • @Macro, when you make an ANOVA table for (eg) a simple regression model, you have 3 rows: reg, res, & tot; & 5 columns: SS, df, MS, F, & p. SS is calculated as your equation w/o dividing, df is 1 for each regressor ($N-1-p$ for res, etc), MS is $SS/df$, F is $MS_{reg}/MS_{res}$, p is $1-cdf(F_{df_{reg},df_{res}})$. Students always assume that $R^2=MS_{reg}/MS_{tot}$, since we say that $R^2$ is the % of *variance* & MS's are variances of a sort, but you have to stress that $R^2=SS_{reg}/SS_{tot}$ *only*. I think the reason you're OK is because $n\neq df$. – gung - Reinstate Monica Jul 21 '12 at 17:42
  • Macro, I do not understand your most recent edit: *Therefore the "explained" variance may be thought of as the proportion of variance in $Y_i$ that is attributable to variation in $X_i$, sometimes referred to as $R^2$.* The explained variance, as you have denoted it in display equation is neither a proportion nor sometimes referred to as $R^2$. – cardinal Jul 22 '12 at 00:03
  • @cardinal, some words got lost in that edit - I've fixed it. Thanks. (btw, the way it was phrased before that was technically incorrect also, so an edit was needed). – Macro Jul 22 '12 at 03:57
  • @gung, Thanks for clarifying. In this case, the mention of $R^2$ was only meant to place the variance decomposition in a familiar context but I think that going into a discussion of what $R^2$ _isn't_ may be an unnecessary digression - I appreciate the comments though. – Macro Jul 22 '12 at 04:13
  • 3 of my top answers un-upvoted within seconds.. To this very mature person: If your goal is to punish me, you're wasting your time- I really don't care. My reputation here has been established by positive contributions to the site in various capacities (giving statistical advice, editing posts/tag wikis, taking out the trash with flags/close votes), not by points. All you accomplish by this is lowering the answer score, potentially taking attention away from authoritative answers that could be useful to others. This can't possibly be a good use of your time and certainly doesn't help the site. – Macro Aug 04 '12 at 05:44
9

I can't run with the big dogs of statistics who have answered before me, and perhaps my thinking is naive, but I look at it this way...

Imagine you're in a car and you're going down the road and turning the wheel left and right and pressing the gas pedal and the brakes frantically. Yet the car is moving along smoothly, unaffected by your actions. You'd immediately suspect that you weren't in a real car, and perhaps if we looked closely we'd determine that you're on a ride in Disney World. (If you were in a real car, you would be in mortal danger, but let's not go there.)

On the other hand, if you were driving down the road in a car and turning the wheel just slightly left or right immediately resulted in the car moving, taping the brakes resulted in a strong deceleration, while pressing the gas pedal threw you back into the seat. You might suspect that you were in a high-performance sports car.

In general, you probably experience something between those two extremes. The degree to which your inputs (steering, brakes, gas) directly affect the car's motion gives you a clue as to the quality of the car. That is, the more of your car's variance in motion that is related to your actions the better the car, and the more that the car moves independently of your control the worse the car is.

In a similar manner, you're talking about creating a model for some data (let's call this data $y$), based on some other sets of data (let's call them $x_1, x_2, ..., x_i$). If $y$ doesn't vary, it's like a car that's not moving and there's really no point in discussing if the car (model) works well or not, so we'll assume $y$ does vary.

Just like the car, a good-quality model will have a good relationship between the results $y$ varying and the inputs $x_i$ varying. Unlike a car, the $x_i$ do not necessarily cause $y$ to change, but if the model is going to be useful the $x_i$ need to change in a close relationship to $y$. In other words, the $x_i$ explain much of the variance in $y$.

P.S. I wasn't able to come up with a Winnie The Pooh analogy, but I tried.

P.P.S. [EDIT:] Note that I'm addressing this particular question. Don't be confused into thinking that if you account for 100% of the variance your model will perform wonderfully. You also need to think about over-fitting, where your model is so flexible that it fits the training data very closely -- including its random quirks and oddities. To use the analogy, you want a car that has good steering and brakes, but you want it to work well out on the road, not just in the test track you're using.

Wayne
  • 19,981
  • 4
  • 50
  • 99