Multiple R squared drops when I cluster dataset

Question

I ran a linear regression with two independent variables on a dataset and got an R squared of approximately 40%. I then divided the dataset into two clusters and ran the linear regression on each of two clusters with the same independent variables.

I was surprised to get lower Multiple R squared of aboutx 30% and 20% on the two clusters.

My intuition didn't think this was possible given that I'm now fitting to a smaller dataset in both clusters, with the same independent variables.

Is this possible or must I be making a mistake somewhere?

This phenomenon was one of the reasons I gave at http://stats.stackexchange.com/a/13317/919 for why $R^2$ is frequently useless or misleading. Pay attention instead to the *residuals* and their dispersion. — whuber, Aug 06 '15 at 02:19
@whuber thanks for that. So I’m starting to get my head around an explanation that makes sense to me now. Given that R-squared is calculated by dividing the Var from the reg line by the Var from the mean, if you subset the data down into smaller groups (based on the mean), then the mean is going to offer a much better estimate of the observations in those groups. Therefore, there won’t be as much difference between the Var from the reg line and the Var from the mean, and the R squared will be smaller. Even though now you are making better predictions of the observations within each group. — Will T-E, Aug 10 '15 at 11:35
@whuber If you subsetted the data based on something other than the mean (or more specifically, on something that is not correlated with the mean) then R squared may not drop. The mean will offer no better estimate of the observations in the group, and the estimate based on the regression line may have in fact improved. This will in fact lead to an increase in R squared. Right?! — Will T-E, Aug 10 '15 at 11:41
Those sound like good explanations--thank you for sharing them! — whuber, Aug 10 '15 at 13:19

score 3 · Accepted Answer · answered Aug 05 '15 at 21:30

3

If I understand your question correctly, perhaps this super awesome visualization will help your understanding of what's going on. This is an extreme case, but something similar may be happening with your data.

Looking at both groups, we will get a reasonably high $r^2$ value. However, looking at each one independently will lead to an $r^2$ of approximately zero.

Edit: math notation

answered Aug 05 '15 at 21:30

John Madden

1,752
9
22

interesting John, I guess this could be it. Still hard to get my head around though! Thanks for your help! I guess this means clustering and then regressing, will still offer more accurate predictions (vs without clustering), it will just give a lower r-squared. Right?! – Will T-E Aug 06 '15 at 10:00
That depends what you're trying to predict. If the cluster a case belongs to contains information about what you're trying to predict, then doing the regression separately would ignore that information. Also, how do you plan to use both regression models in your prediction? – John Madden Aug 06 '15 at 13:04
I'm afraid I don't really understand your comment. Regarding your question, right now I basically just use each regression to predict 1 observation ahead. I'm trying to predict shots in a soccer match, and basically just grouped players into two different categories based on their propensity to shoot. I figured that some independent variables affected one more group more than the other, and that I should therefore perform separate linear regressions. I am starting to think there are better ways to deal with this, however, perhaps introducing moderator variables. – Will T-E Aug 10 '15 at 15:45
That makes sense. I think the approach you're using will give good results. If you don't know which cluster future observations should belong to you can use some kind of classification algorithm. – John Madden Aug 10 '15 at 15:51

Multiple R squared drops when I cluster dataset

1 Answers1