4

Why is the probit model not as popular as logistic regression for binary classification among the machine learning community? It is not or hardly mentioned in serious text books on the topic.

abkg
  • 53
  • 3
  • 2
    I suspect that there may be three possible reasons: (a) using log-odds is easier to motivate and explain; (b) the calculations with log-odds are simpler; and (c) there is often not much practical difference in final conclusions (even though the logistic regression has higher kurtosis than a normal distribution and coefficients may be effectively rescaled) – Henry Feb 06 '22 at 01:11
  • 2
    The main reason probit regression is used in *statistics* (except for a few traditional uses) is to do some more complicated modelling on a Normal latent variable -- eg panel data models, or multivariate outcomes, or measurement error in predictors. There hasn't been much call for these in ML, so no reason to pay the cost of evaluating the probit link function. – Thomas Lumley Feb 06 '22 at 01:58
  • 1
    @ThomasLumley That sounds like the beginning of a really interesting answer! – Dave Feb 14 '22 at 19:42
  • 1
    See [this old question about probit/logit](https://stats.stackexchange.com/questions/20523/difference-between-logit-and-probit-models) with many answers. – kjetil b halvorsen Feb 15 '22 at 02:24

1 Answers1

4

The probit link arose, in ancient times, from the idea of a latent continuous variable with a Normal distribution. This was natural in toxicology, for example, where you might think about the reason some flies died and others survived in terms of differences in individual sensitivity from some distribution

In maths: $$Y^*=\alpha+\beta X+\epsilon$$ with $\epsilon\sim N(0,1)$ followed by $Y=\mathbb{1}\{Y^*>0\}$, gives $$\Phi(P(Y=1)|X=x))=\alpha+\beta X$$

Back then, looking up a probit wasn't that much slower than computing a logit (on a mechanical calculator or slide rule) so there wasn't much computational difference, and a Normal latent variable seems natural.

The logit link was known to not be all that different from the probit, apart from a scale factor in the coefficients of something like $\pi/\sqrt{3}$, so there's not much need to have both models lying around. Some fields used probits, some used logits (eg, epidemiology, because of the nice properties of the odds ratio with respect to case-control sampling and the arguably simpler coefficient interpretation)

When we got to generalised linear models and computers, the probit was a bit inconvenient: logs and exponentials are going to be readily available in your favourite programming language, but the Normal quantile and CDF may not be. There are also speed issues: log and exp became available in hardware floating point units, and later in GPUs.

Because the logit and probit are not very different, it's hard to find applications where it makes a lot of practical difference which one you use. The main sanctuary for the endangered probit model was in settings where the Normal latent variable makes calculations easier.

For example, Charles McCulloch fitted random-effects probit models, by writing the model in terms of latent variables $$Y^*=\alpha+\beta X+u+\epsilon$$ with $\epsilon\sim N(0,1)$ and $u\sim N(0,\tau^2)$ followed by $Y=\mathbb{1}\{Y^*>0\}$, gives $$\Phi(P(Y=1)|X=x, U=u))=\alpha+\beta X+u$$ There's a clever EM algorithm for fitting this model, treating $u$ as missing data, where the M-step looks like a linear mixed model and the E-step samples the latent variables from their distribution conditional on the observed variables.

It's easier to do this sort of thing with a probit link because the latent Normal distribution of $\epsilon$ has simple convolutions and conditional distributions when combined with other Normal latent variables. And since the model won't be very different from a (rescaling of) a logit model, it still makes sense to use the probit model in settings where you'd otherwise want a logit model.

Machine learning doesn't seem to have a lot of models where it's necessary to do this sort of clever maths with latent variables, so there's not as much to balance the computational and interpretation advantages of the logit link.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73