Fewer variables have higher R-squared value in logistic regression

Question

I am testing out 3 modeling approaches for malnutrition in children. Theoretically, distal determinants (education,poverty) operate through proximal determinants (water, sanitation) in determining malnutrition rates. The three logistic models, where stunting is a binary indicator for malnutrition, are:

// Proximal determinants only: both binary indicators
stunting ~ water + sanitation

// Distal determinants only: both categorical indicators
stunting ~ i.education + i.poverty

// Both proximal and distal determinants
stunting ~ water + sanitation + i.education + i.poverty

I am surprised to find that the r-squared value of the second model is higher than the third model, as calculated by the correlation between the predicted and actual values (stata):

predict predicted, xb
corr predicted stunting
local rsq = r(rho)

While I expected the strength of the relationship and statistical significance of the more proximal causes to decrease (as they were soaked up by the distal causes), I expected the combined model to have higher explanatory power (as measured by r-squared). Does anyone have any explanation as to why the second model has the most explanatory power? Let me know if I can provide additional information for answering this question.

Regarding R-squared and logistic regression, you may want to take a look at this post: http://stats.stackexchange.com/a/3562 — Tim, Aug 24 '12 at 16:43
Just a quick check: do any of the independent variables have missing values? If so, that alone can cause the r-squared statistics to be incomparable. — whuber, Aug 24 '12 at 17:14

score 6 · Accepted Answer · answered Aug 24 '12 at 17:00

You should be careful just relying on the R^2 when interpreting fit in a non-linear regression. You may want to compare the Log-Likelihood.

However, a decrease in R^2 with an increase in variable generally means the variables are interacting in a way that is not proving additional explanation of the model. One of the causes may be, as you point out, that there are issues with intervening variables in the model. If this is the case you may need to find an instrumental variable, or use a structural model.

Fewer variables have higher R-squared value in logistic regression

1 Answers1

Linked