1

I'm doing logistic regression in R with binary data (0's and 1's), sample size around 300 : Predicting 1 target variable (varp)

If I use one independent variable ( varx), it's significant (p 0.03, the AIC is 200) : glm(formula = varp ~ varx, family = binomial, data = mydata)

 Coefficients:
                Estimate Std. Error z value   Pr(>|z|)    
    (Intercept)  -1.0251     0.2215  -4.681 0.00000245 ***
    indepvar1  -0.6551     0.3612  -2.118     0.0322 *  
Dispersion parameter for binomial family taken to be 1)
    Null deviance: 211.06  on 205  degrees of freedom
Residual deviance: 206.36  on 204  degrees of freedom
AIC: 200

But when I use multiple independent variables of interest the AIC becomes 170, glm(formula = varp ~ varx+varb+vargg+varkkk...., family = binomial, data = mydata)

How to select the model ( the one with 1 var or the group of vars) that best predict the varp ?:

  1. the model with One independent variable (varx) with AIC 200 , or
  2. a group of variables with AIC 170, in this group, the varx becomes non significant and instead another one is significant ...
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Den
  • 111
  • 2

1 Answers1

1

First the answer: Well, based on your short explanation I would say 2, if you want to predict the dependent variable. If your goal is to model parsimony, then use AIC, if predictive power then adjusted R2. Notice, the adjusted as we in regular regression tend to look at adjusted R2 rather than just R2. You can maximize the predictive power of your model by evaluating prediction error metrics (MAE, RMSE, etc). And maybe don't compare AIC to R2, compare AIC with the change in adjusted R2 instead.

Second some food for thoughts: I do not get the reasoning for your choice of models. Why do you run a model with only one independent variable when you are in possession of more potentially descriptive variables? That does not make any sense in my opinion. Include all the variables in your predictive model instead and you do not have to compare it to anything else.

If it is because you want to know which variables to include or not you should look more into the tag feature-selection and maybe the glmnet package in which you can diminish insignificant independent variables to 0 and then get the feature selection.

Thomas
  • 332
  • 1
  • 13
  • Q:> do not get the reasoning for your choice of models. Why do you run a model with only one independent variable when you are in possession of more potentially descriptive variables? A: Becuase when I use other descriptive variables, only one from 10 are significant..., the remaining 9 are not significant with that AIC that seems better... – Den Aug 20 '20 at 14:29
  • Btw. I adjusted my answer. I meant go with 2, where `AIC` is smaller and variables are included. I would really recommend you to read more into the discard of non-significant variables. It is seen as a bad commonly adopted idea in statistics, [here](https://stats.stackexchange.com/a/476442/290826). And, I still think a model with only one variable is not really a model, at most a very very dubious model. Just because 9 of the 10 variables are not significant does not tell you they are useless. – Thomas Aug 20 '20 at 21:50
  • Cont'd: And, are you sure you then have all the descriptive variables for your independent variable? Maybe you miss some! Try have a look at the `glmnet` package and the opportunities with ridge, lasso & elasticnet models for binary outcome [here](https://www.youtube.com/watch?v=ctmNq7FgbvI&t=12s) – Thomas Aug 20 '20 at 21:50
  • 1, 3 or all 10? if we speak about health data, getting all health variables is useless, if someone have a rib fracture and there is a variable 'tooth_pain'? so why including it .? it's the point, I'd like to narrow that infomatoin and since it's dichotomic data I don't see better things than logistic regression. .... I'm getting confused with the 10 variables now, since I thoght less are better and more explanatory if working with industry specific data. do yo use 'Targets' or other package that can deal with that situation in an automatic matter. Any examples using lasso / glmnet ? – Den Aug 21 '20 at 09:21
  • Of course, you also have to have a certain constrain in picking out your variables. But, in your case, it just seems like (I still have no idea what data you are actually using, as it has not been provided) that the 10 variables chosen are not all so good predictors for your binary outcome, or maybe they are just better predicters for 0. I am not talking about including unintuitive variables but just questioning if you really catch it all with the 10 variables. Where did you read less is better than more? – Thomas Aug 21 '20 at 11:35
  • No, I have only conducted a binary prediction with `glmnet`. I followed [this](https://github.com/StatQuest/ridge_lasso_elastic_net_demo/blob/master/ridge_lass_elastic_net_demo.R). – Thomas Aug 21 '20 at 11:35
  • Thanks for this, what about using 1) BIC instead of AIC to select the right model ? 2) what happens if the intercept becomes non significant ? should we need to select the model where the intercept is "significant only" ? because if I select all 10 variables the intercept becomes non significant... 3) by predictors for 0 you mean to select glm(target == 0~., data=mydata) ? if it's the case, it returns exactly the same as target == 1~ – Den Aug 24 '20 at 10:08
  • Answer to first Q: You imply AIC and BIC try to answer the same question, which is not true. AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. On the contrary, BIC tries to find the TRUE model among the set of candidates. But, that is odd to assume that reality is instantiated in one of the models. This is a real issue for BIC. My recommendation is to use both AIC and BIC. Most of the time they agree on the preferred model when they don't, report it. – Thomas Aug 24 '20 at 10:27
  • Answer to second Q: There are many discussion about insignificant intercept. [Here](https://stats.stackexchange.com/questions/102709/when-forcing-intercept-of-0-in-linear-regression-is-acceptable-advisable/102712#102712) and [here](https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model) and maybe the last [here](https://stats.stackexchange.com/questions/92903/intercept-term-in-logistic-regression) is better for your problem. – Thomas Aug 24 '20 at 10:29
  • Lastly, third Q: Well, I hope you will get the same result. Sorry, I did not formulate it well enough before. But, you understand that with a logistic regression you eventually find predictors for discriminating between 0 and 1, so the outcome variables are the same. – Thomas Aug 24 '20 at 10:31
  • Did it all help you? – Thomas Aug 25 '20 at 18:05
  • 1) Hi Thomas, a new question arrise: the intercept is sometimes not significant, I've read that it's not important, it's the explicative variables that are important, how to select the model if one is with a significant intercept < 0.05 and another is with not singnificant : p : 0.69 ... ? (same intercept but in another model) 2) i did not get the point where you tell models with only one indepenedent significant variable are not good ? I found one model and selected the variable with very small p, the AIC of that model is average , so there are better AIC models but weaker P of indep var – Den Aug 26 '20 at 15:31
  • A1: Well, then I would say we return to some of your original questions about choosing model, if insignificant intercept does not play a major role then you choose your model based on the `AIC` and `BIC` (individually, as described before). I would choose the model with the lowest `AIC` if positive, or highest if `AIC` is negative similar for `BIC`. – Thomas Aug 27 '20 at 11:43
  • A2: Okay, now we talk about one significant independent variable amongst many variables included. I understood it as you run a regression based on only one dependent variable and one independent variable. If you have a model with multiple variables but only one is significant - go ahead and use that then I must've misinterpreted your answer and it is my bad. But, don't discard the insignificant variables as noise. They tell you something important as well you just have to interpret them correctly. – Thomas Aug 27 '20 at 11:46