1

I have fitted two different binomial logistic regression models A and B. Model A contains only one predictor variable. Model B contains a different set of predictor variables, none of which is the predictor included in model A. There is a notable degree of multicollinearity between the predictors in model B. I want to compare how well the two models can account for the variation in my data.

Usually, when discussing the consequences of multicollinearity in linear regression models, most authors focus on the effect on the predictors. For example Dormann et al (2012) point out that the standard errors of collinear coefficients will be inflated, which leads to "inaccurate tests of significance for the predictors, meaning that important predictors may not be significant, even if they are truly influential" (p. 29).

However, what is less clear to me is the effect of multicollinearity on the overall performance of the model. This question asks whether multicollinearity affects the performance of the model as a classifier, with reference to the Wikipedia article on Multicollinearity which says that "multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set". In his answer to the question, @EdM seems to confirm that multicollinearity does not affect model reliability unless it is used to predict a data set different from the one used to fit it.

My case is somewhat different because I don't want to use the models as classifiers on new data. Instead, I want to compare how well they can explain my data set. So, that answer still leaves me with the following questions:

  • Is it valid to say that the explained variance of a model is invariant to the presence of multicollinearity?
  • Can I use still measures such as AIC or AUROC to compare the performance of my models A and B even though the predictors of model B are strongly correlated?
  • Is there a quotable reference that discusses the effect of multicollinearity on the explained variance of models and on measures such as the AIC or AUROC?
Schmuddi
  • 111
  • 5
  • What a dynamite question. btw I'll say Yes to the first 2 questions. I am curious, though, why within-sample performance is all you care about here. – rolando2 Feb 03 '17 at 18:42
  • @rolando2: There's a claim made by one author that his novel predictor is as good as the established predictors discussed by other authors. My data set comes from an experiment specifically designed to test the effect of the novel predictor in comparison to the combined effect of the established predictors. The point of the analysis is not to tease apart the effect of the established predictors, or to verify that they actually work, the point is only to see whether the model using the novel predictor performs as good as the model using the established predictors. – Schmuddi Feb 03 '17 at 18:51
  • Thanks....A predictor requires a coefficient :-) How will anyone know that the within-sample results give the best, most reliable estimates of this novel predictor's coefficient...standardized coefficient...standard error...or, generally, performance? – rolando2 Feb 03 '17 at 19:18
  • The concept of explained variance has no clear application to logistic regression, the way it does to linear regression. Are you sure "explained variance" is what you mean? – Kodiologist Feb 03 '17 at 19:36
  • @Kodiologist: You're right, of course. What I mean in my first question is probably "Is it valid to say that the capability of a logistic regression model to correctly predict the probabilities of the values of the dependent variable is invariant to the presence of multicollinearity?" – Is there a more concise way of putting that? I'd gladly correct my question to be more accurate. – Schmuddi Feb 03 '17 at 19:42
  • @Schmuddi Yes, but then I don't understand what your concern is. The notion of predictive accuracy only makes sense when we consider performance on unseen data, and you've said that you only care about performance in a fixed sample. – Kodiologist Feb 03 '17 at 19:46
  • @Kodiologist: Let me try to rephrase my concern. When remapping the predicted probabilities of model A to a binary response, model A makes the correct predictions for say 70% of the observations. When doing the same for model B, model B makes correct predictions for say 75% of the observations. Is it still valid to say that model B can account better for the values of the observations than model A even though there is multicollinearity among the predictors used in model B? Or does the multicollinearity invalidate any comparison between A and B? – Schmuddi Feb 03 '17 at 19:51
  • @Schmuddi Multicollinearity indeed doesn't make the agreement between model outputs and actual observations any less useful as a measure of model fit, as far as I can see. I mean, I don't know why anybody would think that. – Kodiologist Feb 03 '17 at 20:10
  • @Kodiologist: "I don't know why anybody would think that" – unfortunately, I'm in a less than ideal position to ask that one particular reader of my manuscript who raised this issue what he was thinking, if you get my drift... – Schmuddi Feb 03 '17 at 20:21
  • @Schmuddi You mean a reviewer said that? It could be helpful to quote his or her comment exactly so we can help you decide on a reply. – Kodiologist Feb 04 '17 at 00:15

0 Answers0