1

If you look at basic resources on R-squared, such as https://en.wikipedia.org/wiki/Coefficient_of_determination, they all tend to say the same thing: That it is a measure "about the goodness of fit of a model." However, when dealing with linear regressions that have a low absolute slope, that doesn't seem to (entirely) be the case.

Here's a dataset with 100 points drawn from a standard normal distribution. The values have been slightly tweaked so that the linear fit is exactly $y=0$:

Demonstration graph

$R^2$ for the linear regression will be 0, which follows directly from the definition: Because the model is exactly the same as the mean (both are 0), the squared error of both is the same in all cases. However, both from eyeballing the result and from understanding the source of the data, we know that the linear regression is actually a good fit, despite the worst-possible R-squared of 0.

In contrast, if we bias the data with a component of $\frac{x}{10}$, we get the 2nd part of the graph. The linear fit to this biased data is $R^2 \approx 0.89303$, despite it fitting no better (or worse) than the original data - all that's happened is the addition of a small slope.

So it seems clear that $R^2$ is actually very sensitive to the slope of the fit, not just the goodness of the fit. Is this well-known/discussed anywhere? Or am I missing something in this analysis?

D0SBoots
  • 111
  • 3
  • $R^2$ compares the amount of variability explained by the model compared to a naïve model that always predicts the mean of $y$. – Dave Apr 15 '21 at 04:00
  • @Dave: Yes, I know what $R^2$ is and what it is defined to measure, what I'm curious about is its use as a measure of goodness-of-fit. Most places that discuss it seem to just assume that if $R^2$ is low, i.e. none of the variance has been accounted for, then it must be a bad fit. But the slope-dependence seems to throw that into doubt. – D0SBoots Apr 15 '21 at 04:07
  • 2
    You seem to have a different concept of what "goodness of fit means" than your references do. I'm not disagreeing with you, but only trying to suggest that therein lies a simple resolution to your question. (My opinions about how $R^2$ might be related to GoF are partly expressed at https://stats.stackexchange.com/a/13317/919.) – whuber Apr 15 '21 at 14:04
  • That answer you linked to was great, but it also left me a little confused. You wrote, "Normally it tells us nothing about "linearity" or "strength of relationship" or even "goodness of fit" for comparing a sequence of models." That was the conclusion I was starting to come to as well, but it seems to contradict the sources I've read. – D0SBoots Apr 16 '21 at 20:07
  • " correlation only makes sense if the relationship is indeed linear. Second, the slope of the regression line is proportional to the correlation coefficient: slope = r*(SD of y)/(SD of x) Sometimes students will equate a steep slope with a high value of the correlation coefficient. This is an easy mistake to make, because the slope does depend directly on the correlation coefficient. However, the ratio of the standard deviations of y to x plays an equal role, and so one should not think "steep slope == high r"." - http://inspire.stat.ucla.edu/unit_02/teaching_tips.php – Polisetty Nov 20 '21 at 12:26

0 Answers0