Model selection with Firth logistic regression

Question

In a small data set ($n\sim100$ ) that I am working with, several variables give me perfect prediction/separation. I thus use Firth logistic regression to deal with the issue.

If I select the best model by AIC or BIC, should I include the Firth penalty term in the likelihood when computing these information criteria?

Good question! Both are derived assuming maximum-likelihood fits, so you can't use the unpenalized likelihood; but I'm not sure that simply substituting it with the penalized likelihood is the answer. — Scortchi - Reinstate Monica, Mar 01 '14 at 21:31
Konishi & Kitagawa (1996), "Generalized information criteria in model selection", *Biometrika*, **83**, 4 might be helpful. — Scortchi - Reinstate Monica, Mar 01 '14 at 21:51
Just for reference, it's [here](http://biomet.oxfordjournals.org/content/83/4/875.abstract), but I don't have access -- is that the one where they proposed the trace of information matrix (or something like that) as the penalty term? I remember working with their idea like 15 years ago in mixture modeling. — StasK, Mar 02 '14 at 03:04
It's twice the trace of the ratio of two things, both of which are going to be equal to the information matrix for maximum-likelihood fits and thus give $2p$ as the special case for AIC. (Or something like that - it's going to take me a while to read this.) — Scortchi - Reinstate Monica, Mar 02 '14 at 13:33
I try to avoid variable selection in general, and in this setting variable selection is particularly complicated. — Frank Harrell, Mar 04 '14 at 22:53
Thanks, @FrankHarrell. I am afraid this is unavoidable for me in this project. — StasK, Mar 04 '14 at 23:38
Would you mind explaining why it is unavoidable, since variable selection does not help with the "too many variables, too little sample size" problem? — Frank Harrell, Mar 05 '14 at 12:29
Thanks @FrankHarrell -- referee requests, as usual. The kitchen sink regression does not show anything at all, and everybody (both other authors on the paper and the referees) want to see a tight model that would explain *something*. — StasK, Mar 05 '14 at 16:38
Have you considered treating this a Bayesian inference problem? Firth logistic regression is equivalent to MAP with jeffreys prior. You could use the fully laplace approximation to evalute marginal likelihoods - which is like an adjusted BIC (similar to AICc) — probabilityislogic, Mar 09 '14 at 01:38
maybe dumb question, but if you know what variable is causing perfect separation, why use any other variable at all? Isn't it like having perfect fit? — user603, Apr 02 '15 at 15:39
@user, Because such variables usually predict only a handful of cases, and that is irreproducible: the true probability for that cell may be close to 90% say but with only two cases in it, you will get two ones 81% of the time. — StasK, Apr 04 '15 at 12:53
Link to download K&K (1996) paper found on Google Scholar, http://bemlar.ism.ac.jp/zhuang/Refs/Refs/kitagawa1996biometrika.pdf — Alecos Papadopoulos, May 30 '15 at 00:40
What did you end up doing for this problem? I would think yes, substitute with the penalised likelihood in AIC since it's part of parameter selection. I can't imagine you could leave it out. I don't know if there are better ways to incorporate it though. — Margalit, Jul 12 '17 at 22:39
@Margalit: I don't think this answers the too many variables not enough samples issue. I think you are suggest ways to try to avoid overfitting. — Michael R. Chernick, Jul 12 '17 at 23:17

lbelzile · Answer 1 · 2020-09-10T21:00:00.957

If you want to justify the use of BIC: you can replace the maximum likelihood with the maximum a posteriori (MAP) estimate and the resulting 'BIC'-type criterion remains asymptotically valid (in the limit as the sample size $n \to \infty$). As mentioned by @probabilityislogic, Firth's logistic regression is equivalent to using a Jeffrey's prior (so what you obtain from your regression fit is the MAP).

The BIC is a pseudo-Bayesian criterion which is (roughly) derived using a Taylor series expansion of the marginal likelihood $$p_y(y) = \int L(\theta; y)\pi(\theta)\mathrm{d} \theta$$ around the maximum likelihood estimate $\hat{\theta}$. Thus it ignores the prior, but the effect of the latter vanishes as information concentrates in the likelihood.

As a side remark, Firth's regression also removes the first-order bias in exponential families.

Model selection with Firth logistic regression

1 Answers1