Why to put variance around the mean line to the definition of $R^2$? By what is this particular choice dictated?

Question

Suppose we have a linear regression and we calculate $R^2 = 0.81$. That means $81\text %$ less variance around the regression line than mean line, since $R^2 = \frac{\mathrm{Var\ (mean\ line) - Var\ (regression\ line)}}{\mathrm{Var\ (mean\ line)}}$.

Question: why do we actually compare to the variance around exactly the mean line? Is it just the most natural choice, since we need to compare the variance around our fitted line to something? Why not to compare to the variance around something like $f(n) = 2n + 10$ instead (I've just invented it)?

Dave · Accepted Answer · 2021-09-28T15:11:25.523

2

The linear regression is predicting the conditional mean. Therefore, the natural choice for a naïve guess is just to predict the marginal/pooled mean of $y$, regardless of what the features are.

Therefore, the interpretation of $R^2$ involves how your model tightens up the guessing, compared to doing the most naïve sensible type of predicting.

As an analogy, if you had to predict the probability of a coin coming up heads, the naïve model would be to guess $0.5$. Then if you can examine the coin for weight distribution or if it has heads/tails on both sides, you can tighten up the guessing. Until you have access to that additional information, however, the only reasonable guess is $0.5$.

(I have heard that coins are not actually 50/50. Let’s ignore that for now.)

edited Sep 28 '21 at 15:11

answered Sep 27 '21 at 03:04

Dave

28,473
4
52
104

Thank you for the answer! Could you clarify what is meant by "conditional mean" and "pooled mean"? (in your first paragraph) – mathgeek Sep 27 '21 at 03:16
1

Conditional mean is the mean value of $y$ for a set of $x$ features. Pooled mean is what you get if you add up all of the $y$ values and divide by the sample size (sample mean of $y$). These are pretty fundamental ideas in linear regression and the extensions to generalized linear models, machine learning, and predictive modeling, so if they are not clear to you, you might want to consult a textbook on regression like Agrest’s “Foundations of Linear and Generalized Linear Models”. – Dave Sep 27 '21 at 03:25
I'm just learning calculus. Does your book require a lot of knowledge? – mathgeek Sep 27 '21 at 20:27
Yes, then Agresti is beyond where you would want to operate. Unfortunately, I did not take statistics between high school and grad school, so I don't know of a more elementary text. – Dave Sep 27 '21 at 20:34

Why to put variance around the mean line to the definition of $R^2$? By what is this particular choice dictated?

1 Answers1

Linked