I am testing out 3 modeling approaches for malnutrition in children. Theoretically, distal determinants (education,poverty) operate through proximal determinants (water, sanitation) in determining malnutrition rates. The three logistic models, where stunting is a binary indicator for malnutrition, are:
// Proximal determinants only: both binary indicators
stunting ~ water + sanitation
// Distal determinants only: both categorical indicators
stunting ~ i.education + i.poverty
// Both proximal and distal determinants
stunting ~ water + sanitation + i.education + i.poverty
I am surprised to find that the r-squared value of the second model is higher than the third model, as calculated by the correlation between the predicted and actual values (stata):
predict predicted, xb
corr predicted stunting
local rsq = r(rho)
While I expected the strength of the relationship and statistical significance of the more proximal causes to decrease (as they were soaked up by the distal causes), I expected the combined model to have higher explanatory power (as measured by r-squared). Does anyone have any explanation as to why the second model has the most explanatory power? Let me know if I can provide additional information for answering this question.