22

In a small data set ($n\sim100$ ) that I am working with, several variables give me perfect prediction/separation. I thus use Firth logistic regression to deal with the issue.

If I select the best model by AIC or BIC, should I include the Firth penalty term in the likelihood when computing these information criteria?

StasK
  • 29,235
  • 2
  • 80
  • 165
  • Good question! Both are derived assuming maximum-likelihood fits, so you can't use the unpenalized likelihood; but I'm not sure that simply substituting it with the penalized likelihood is the answer. – Scortchi - Reinstate Monica Mar 01 '14 at 21:31
  • Konishi & Kitagawa (1996), "Generalized information criteria in model selection", *Biometrika*, **83**, 4 might be helpful. – Scortchi - Reinstate Monica Mar 01 '14 at 21:51
  • Just for reference, it's [here](http://biomet.oxfordjournals.org/content/83/4/875.abstract), but I don't have access -- is that the one where they proposed the trace of information matrix (or something like that) as the penalty term? I remember working with their idea like 15 years ago in mixture modeling. – StasK Mar 02 '14 at 03:04
  • It's twice the trace of the ratio of two things, both of which are going to be equal to the information matrix for maximum-likelihood fits and thus give $2p$ as the special case for AIC. (Or something like that - it's going to take me a while to read this.) – Scortchi - Reinstate Monica Mar 02 '14 at 13:33
  • I try to avoid variable selection in general, and in this setting variable selection is particularly complicated. – Frank Harrell Mar 04 '14 at 22:53
  • Thanks, @FrankHarrell. I am afraid this is unavoidable for me in this project. – StasK Mar 04 '14 at 23:38
  • 2
    Would you mind explaining why it is unavoidable, since variable selection does not help with the "too many variables, too little sample size" problem? – Frank Harrell Mar 05 '14 at 12:29
  • Thanks @FrankHarrell -- referee requests, as usual. The kitchen sink regression does not show anything at all, and everybody (both other authors on the paper and the referees) want to see a tight model that would explain *something*. – StasK Mar 05 '14 at 16:38
  • 4
    That is as bad as it gets. – Frank Harrell Mar 05 '14 at 23:01
  • 1
    Have you considered treating this a Bayesian inference problem? Firth logistic regression is equivalent to MAP with jeffreys prior. You could use the fully laplace approximation to evalute marginal likelihoods - which is like an adjusted BIC (similar to AICc) – probabilityislogic Mar 09 '14 at 01:38
  • maybe dumb question, but if you know what variable is causing perfect separation, why use any other variable at all? Isn't it like having perfect fit? – user603 Apr 02 '15 at 15:39
  • 1
    @user, Because such variables usually predict only a handful of cases, and that is irreproducible: the true probability for that cell may be close to 90% say but with only two cases in it, you will get two ones 81% of the time. – StasK Apr 04 '15 at 12:53
  • 1
    Link to download K&K (1996) paper found on Google Scholar, http://bemlar.ism.ac.jp/zhuang/Refs/Refs/kitagawa1996biometrika.pdf – Alecos Papadopoulos May 30 '15 at 00:40
  • What did you end up doing for this problem? I would think yes, substitute with the penalised likelihood in AIC since it's part of parameter selection. I can't imagine you could leave it out. I don't know if there are better ways to incorporate it though. – Margalit Jul 12 '17 at 22:39
  • @Margalit: I don't think this answers the too many variables not enough samples issue. I think you are suggest ways to try to avoid overfitting. – Michael R. Chernick Jul 12 '17 at 23:17

1 Answers1

3

If you want to justify the use of BIC: you can replace the maximum likelihood with the maximum a posteriori (MAP) estimate and the resulting 'BIC'-type criterion remains asymptotically valid (in the limit as the sample size $n \to \infty$). As mentioned by @probabilityislogic, Firth's logistic regression is equivalent to using a Jeffrey's prior (so what you obtain from your regression fit is the MAP).

The BIC is a pseudo-Bayesian criterion which is (roughly) derived using a Taylor series expansion of the marginal likelihood $$p_y(y) = \int L(\theta; y)\pi(\theta)\mathrm{d} \theta$$ around the maximum likelihood estimate $\hat{\theta}$. Thus it ignores the prior, but the effect of the latter vanishes as information concentrates in the likelihood.

As a side remark, Firth's regression also removes the first-order bias in exponential families.

lbelzile
  • 482
  • 3
  • 8