Should I report the pseudo $R^2$ value for full or final logistic regression model after removing NA's & running stepwise selection?

Question

I'm working with a logistic regression model in r.

model <- glm(response~., family="binomial", data)

and I'm using DescTools::PseudoR2(model, which="Nagelkerke") to get an estimate of model fit.

My dataset has missing values, so if I want to do stepwise selection I have to remove these missing values using na.omit(), but doing so drops 37 out of 187 rows of data.

data2  <- na.omit(data) 
model2 <- glm(response~., family="binomial", data2)

Comparing the two models before using stepwise, I noticed a significant drop in my pseudo $R^2$ value:

DescTools::PseudoR2(model,  which="Nagelkerke")  # 0.6496515  
DescTools::PseudoR2(model2, which="Nagelkerke")  # 0.4652934

After using model 2 to run a backward step based on AIC, my pseudo $R^2$ drops to around .35, which makes sense with fewer variables.

I'm wondering which $R^2$ result I should trust when presenting the model?

Could I present the $R^2$ from model 1 and build a model with the variables kept from the step model, but use the non-NA removed data or is this an inflated and disingenuous estimate?

You don't want to use stepwise selection, regardless of missing data. You might consider doing multiple imputation and then using some more sensible way of building a model. — Peter Flom, Mar 08 '19 at 19:05
You should not use stepwise selection. It may help you to read my answer here: [Algorithms for automatic model selection](https://stats.stackexchange.com/a/20856/7290). You should also be wary, or at least clear about, using pseudo-$R^2$ (or even 'regular' $R^2$) as a measure of fit. It may help you to read: [Which pseudo-$R^2$ measure is the one to report for logistic regression (Cox & Snell or Nagelkerke)?](https://stats.stackexchange.com/q/3559/), & [Is $R^2$ useful or dangerous?](https://stats.stackexchange.com/q/13314/) — gung - Reinstate Monica, Mar 08 '19 at 19:10

score 1 · Answer 1 · answered Mar 08 '19 at 19:57

The short answer to your question is that you cannot use the pseudo $R^2$ from the full model. Your intuition is correct that it is inflated and disingenuous.

That said, you should not use the pseudo $R^2$ from the final model, either. Part of the reason is that pseudo $R^2$ does not really mean what people think it means, and part of the reason is that any attempt to assess goodness of fit after stepwise selection will be irrevocably flawed.

Put simply, you should not use stepwise selection at all.

From there, decide how you want to assess your model. Generally, regression models are designed to pick out conditional means (i.e., the mean of $Y$ when $X$ equals some particular value). So goodness of fit would mean that the model's fitted values are approximately right. You could assess this in various ways, such as plots or tests against saturated models, etc. Alternatively, you could see how well the model predicts out of sample. This could be done via cross validation and Brier scores or calibration.

The question of dropping the NA's is a different issue. That is called 'complete case analysis'. It is valid (but somewhat underpowered) under the assumption that the missingness is MCAR, and often still valid even under MAR (depending on the specifics of the situation).

Here are some threads to read that might help you:

Should I report the pseudo $R^2$ value for full or final logistic regression model after removing NA's & running stepwise selection?

1 Answers1