1

I am building a logistic multiple regression with 5 potential variables candidates. I read this post and fit all 2^5 = 32 possible combinations of explanatory variables and chose the best model by AIC. However the ‘best’ model does not include any significant variable whereas some other models (with higher AIC) do have some significant variables. I am not statistician and do not understand why such situation. Thank you for any comments, clarification or guidance for another approach if needs be.

user38000
  • 11
  • 1

1 Answers1

3

This is just a personal opinion, so take it with a grain of salt.

There are a few ways to approach model selection. One of the ways is to construct all possible models and select the one that is 'the best' according to some criterion, e.g. AIC. This is model dredging and is frowned upon by some people because it doesn't incorporate any existing knowledge and maximizes the chance to find a significant model by chance alone. This is basically what you're doing. That being said, data dredging is probably fine if you are exploring an unknown area of research where there is very little or no theoretical knowledge.

Another way would be to, based on what you know of the phenomenon you're trying to model, construct a few valid hypotheses and then construct a model to test these hypotheses. When you compare these models (for example using AIC), you can say something along the lines of "based on what is already known, out of these hypotheses, this hypothesis (or hypotheses) appears describe my data best". You can also take model parameters of a few best models, average them (weighted average) et voila, you have an "average" model that is a compromise between a few good models. This is called the information-theoretic approach. In biology, a book by Burnham and Anderson (Model selection and multimodel inference) is an advocate of this approach. My memory isn't the best, but I think Gentle introduction to MARK has a chapter on this (with a technical part for this specific program on how to model average).

Roman Luštrik
  • 3,338
  • 3
  • 31
  • 39
  • Thanks @Roman for this comprehensive review. If I would go for the expertise knowledge as stated in your second point, how should I consider or interpret a best model with lower AIC but with no significant variables whereas another model with higher AIC include significant variables? – user38000 Jan 27 '14 at 10:44
  • "it doesn't incorporate any existing knowledge and maximizes the chance to find a significant model by chance alone. This is basically what you're doing." It reads as something bad. Computational costs permitting, isn't the approach used by the OP the best one for model selection? – user603 Jan 27 '14 at 10:50
  • @user603: thanks. Am I wrong if I then consider the model with at least one significant variable with the lower AIC among the subset of models first selected by expert knowledge (rather than the model with the lower AIC even no significant variables). – user38000 Jan 27 '14 at 11:30
  • @user38000 you would concluce, that that variable has no statistically significant effect on the outcome. – Roman Luštrik Jan 27 '14 at 12:16
  • @user603 that's why I prefaced the text indicating it's my opinion and others may disagree. I think that dredging is OK if you have no working hypothesis/hypotheses. If there's theory behind your problem, one should use also that. Thanks for pointing it out, I will add this to the original text. – Roman Luštrik Jan 27 '14 at 12:17
  • @RomanLuštrik: my comment was meant as a question, not as a criticism of your answer. – user603 Jan 27 '14 at 12:35
  • Just to confirm and to be sure I get your points @user603 and @ Roman: Then if a I have 3 models (M1, M2 and M3) based on expertise knowledge, should I run all the three and select the one with the lower AIC even with any significant variable (for example M1) regardless if M2 for example included one significant variable but with higher AIC? – user38000 Jan 27 '14 at 13:05
  • 3
    This kind of dredging is *NOT* OK. There are no principles in statistics that back up the approach you have outlined. A few simulations would expose the damage that variable selection does. In general, variable selection without penalization is invalid. – Frank Harrell Jan 27 '14 at 13:22
  • @Frank, could you please more elaborate on penalization for logistic mixed regression? – user38000 Jan 27 '14 at 13:34
  • 2
    In my experience, if you are just interested in predictive performance, don't perform any feature selection, but use penalised regression (c.f. ridge regression), see my answer to the question referenced in the question above. It is all too easy to over-fit the feature selection criterion and end up with a model that performs worse than the one with all of the features. You might find this paper useful http://www.jstor.org/stable/2347628 – Dikran Marsupial Jan 27 '14 at 15:14
  • Thanks @Dikran. I will have a look and try to see how far it will be possible to implement it with sas glimmix – user38000 Jan 27 '14 at 16:06
  • But as non statistician, I am a bit confused about statistics. I understood that automated selction is to be avoided. But I learned so far that AIC is still OK for non nested models. What is then wrong in, based on hypothesis via expert knowledge, stipulating three models and selecting the best one by AIC? – user38000 Jan 27 '14 at 19:32
  • This is basically what I advocate in my answer. It can go wrong if one of the three models is trivial, thus giving unfair advantage to other two. – Roman Luštrik Jan 27 '14 at 20:45
  • Examples of penalization include lasso, elastic net, ridge regression. Related to whether AIC is OK for aggressive model selection, just note that AIC is just a restatement of P-values so the severe problems of variable selection based on statistical significance is not solved by AIC. – Frank Harrell Jan 27 '14 at 21:23
  • I will take into consideration all your valuable inputs. Great – user38000 Jan 28 '14 at 07:32