1

I have a logistic regression model with ten independent variables of which two are included as controls. While their inclusion is necessary for correctly assessing the coefficients of the other variables, the two control variables are causing an extremely good overall prediction accuracy. To complement the evaluation of the individual coefficients, I would like to find a summary statistic of the predictive accuracy achieved solely by the combined effects of the other eight predictors of the model, excluding the controls.

One of the answers here: https://www.researchgate.net/post/Questions_regarding_control_variables suggested first creating a model using only the controls, note the accuracy scores (AUC and Brier scores retrieved by means of bootstrap resampling in this case), then implement the full model and subtract the accuracy scores achieved using only the controls. The difference is then to give a measure of predictive success provided by the remaining variables.

Is this a valid and recommended approach? I can't seem to find this anywhere on SE, probably due to my limited grasp on the relevant terminology or because the answer is glaringly obvious.

humperderp
  • 121
  • 6
  • you can do it, but the model with more variables will always have better within sample performance. I think if you want to know how much x contributes, just look at its coefficients – rep_ho Jan 21 '20 at 08:33
  • I should probably have specified that I intend to discuss each coefficient individually, but that I would like to complement this with a summary statistic of the predictive ability of these variables combined. – humperderp Jan 21 '20 at 10:52

2 Answers2

1

This came down to extremely poor research on my behalf. A more appropriate wording of the question would appear to be that it is concerned with the added predictive value from including the variables other than the controls. Alternatively, a nested model comparison between the model using only controls and the full model.

Following this blogpost https://www.fharrell.com/post/addvalue/ from Frank Harrell, the procedure I described above of comparing performance measures appears to be reasonable, apart from the choice of measure, where the AUC is not sensitive enough for model comparison. Pseudo R-squared, as mentioned by @Janosch, based on the log-liklihood appears to be a good alternative, and although plenty more alternatives are presented in the blogpost, a liklihood ratio test between the model using only the controls and the full model would also appear to be a straightforward and sensible approach.

humperderp
  • 121
  • 6
0

You can also use a pseudo R-Squared for logistic regressions -> https://thestatsgeek.com/2014/02/08/r-squared-in-logistic-regression/

And evaluate the R squared for your Nullmodel with only controls and your alternative models with all variables.

Janosch
  • 530
  • 2
  • 10
  • Thanks, but I think I would like to avoid pseudo R-squared, following from answers to this question for example: https://stats.stackexchange.com/questions/3559/which-pseudo-r2-measure-is-the-one-to-report-for-logistic-regression-cox-s – humperderp Jan 21 '20 at 10:43
  • It says in the citation "However, they may be helpful in the model building state as a statistic to evaluate competing models." – Janosch Jan 21 '20 at 11:52
  • Sure, but does it matter for the validity of the procedure I described whether I used the Brier score or pseudo R-squared as the overall performance measure? – humperderp Jan 21 '20 at 12:27