Why do we make a big fuss about using Fisher scoring when we fit a GLM?

Question

I'm curious about why we treat fitting GLMS as though they were some special optimization problem. Are they? It seems to me that they're just maximum likelihood, and that we write down the likelihood and then ... we maximize it! So why do we use Fisher scoring instead of any of the myriad of optimization schemes that has been developed in the applied math literature?

As far as I understand it's got to do with the fact that the algorithm based on Fisher scoring (which uses the expected Hessian) does not need starting estimates of your coefficient vector, unlike regular Newton-Raphson (which uses the observed Hessian), which does... This makes Fisher scoring much easier to use. But some use hybrid algorithms, starting with IRLS and then switching to Newton-Raphson. See section 3.4 in the book of Hardin & Hilbe, http://gen.lib.rus.ec/search.php?req=Generalized%20Linear%20Models%20and%20Extensions&lg_topic=libgen&open=0&view=simple&res=25&phrase=1&column=def — Tom Wenseleers, Jan 21 '20 at 16:03

score 14 · Accepted Answer · answered Jul 20 '10 at 22:11

14

Fisher's scoring is just a version of Newton's method that happens to be identified with GLMs, there's nothing particularly special about it, other than the fact that the Fisher's information matrix happens to be rather easy to find for random variables in the exponential family. It also ties in to a lot of other math-stat material that tends to come up about the same time, and gives a nice geometric intuition about what exactly Fisher information means.

There's absolutely no reason I can think of not to use some other optimizer if you prefer, other than that you might have to code it by hand rather than use a pre-existing package. I suspect that any strong emphasis on Fisher scoring is a combination of (in order of decreasing weight) pedagogy, ease-of-derivation, historical bias, and "not-invented-here" syndrome.

answered Jul 20 '10 at 22:11

Rich

4,336
1
20
20

2

I don't think this is quite correct - the IRLS algorithm uses the expected Hessian, whereas Newton-Raphson uses the observed Hessian - see http://gen.lib.rus.ec/search.php?req=Generalized%20Linear%20Models%20and%20Extensions&lg_topic=libgen&open=0&view=simple&res=25&phrase=1&column=def for a detailed comparison of the 2 algorithms... – Tom Wenseleers Aug 27 '19 at 20:58
@TomWenseleers Could you perhaps elaborate in an answer? Does this mean the algorithmic complexity of beta regression is not the reason it is treated as a separate issue from GLMs? – Frans Rodenburg Jan 21 '20 at 02:29
1

@Frans Rodenburg Not so well into beta regression, but I believe that the standard IRLS method only works for single-parameter distributions from the exponential family, whereas beta regression is a 2-parameter exponential distribution... See https://stats.stackexchange.com/questions/304538/why-beta-dirichlet-regression-are-not-considered-generalized-linear-models Cox proportional hazard and negative binomial also each have an additional parameter though and they can be fit using a modified IRLS algo, so not sure really... – Tom Wenseleers Jan 21 '20 at 11:17
1

Other advantage of using Fisher scoring / IRLS with the expected Hessian btw is that the algo is much easier to initialize - see section 3.4 in Hardin & Hilbe's book. This contrasts with Newton Raphson where you need to have an initial guess of the coefficient vector, which is a little difficult... Sometimes people then use hybrid algorithms, and starts with IRLS with Fisher scoring and then after some iterations switch to regular Newton Raphson... – Tom Wenseleers Jan 21 '20 at 11:21
1

Another good point about Fisher scoring is that the expected Fisher information is always positive (semi-)definite, whereas the second derivative of the loglikelihood need not be. For typical GLMs this isn't a big issue, but for parametric survival models there is a real problem that the second derivative need not be positive semidefinite. – Thomas Lumley Jun 18 '20 at 09:21

score 9 · Answer 2 · answered Nov 23 '11 at 02:36

9

It's historical, and pragmatic; Nelder and Wedderburn reverse-engineered GLMs, as the set of models where you can find the MLE using Fisher scoring (i.e. Iteratively ReWeighted Least Squares). The algorithm came before the models, at least in the general case.

It's also worth remembering that IWLS was what they had available back in the early 70s, so GLMs were an important class of models to know about. The fact you can maximize GLM likelihoods reliably using Newton-type algorithms (they typically have unique MLEs) also meant that programs like GLIM could be used by those without skills in numerical optimization.

answered Nov 23 '11 at 02:36

guest

2,381
14
11

1

I don't think this is quite correct - the IRLS algorithm uses the expected Hessian, whereas Newton-Raphson uses the observed Hessian - see http://gen.lib.rus.ec/search.php?req=Generalized%20Linear%20Models%20and%20Extensions&lg_topic=libgen&open=0&view=simple&res=25&phrase=1&column=def for a detailed comparison of the 2 algorithms... – Tom Wenseleers Aug 27 '19 at 20:59
@TomWenseleers I've been wondering about it as well, would be cool to have something on this – Firebug Dec 09 '20 at 19:45
1

@Firebug You can take a look in Hardin & Hilbe's book "Generalized Linear Models and Extensions", https://siteget.net/o.php?b=4&pv=4&mobile=&u=http%3A%2F%2Flibrary.lol%2Fmain%2F1D9541860ED294EE1CD06F70266971F6, (Sections 3.4 & Sections 3.6 & 3.7, Listing 3.1 & 3.2 and Section 5.6, Listing 5.4 compare the IRLS algorithms & Newton-Raphson). The IRLS algo has the advantage of being much easier to initialise, – Tom Wenseleers Dec 09 '20 at 21:30
1

@Firebug That's because the IRLS algo you can initialize with an initial guess of the fitted values (which is easy), whereas Newton-Raphson you have to initialize with initial guesses of the coefficients to be estimated (which is very hard). Sometimes people use combined algorithms, where they start with IRLS, and then once they have good good coefficients estimates switch to Newton-Raphson... – Tom Wenseleers Dec 09 '20 at 21:32
1

@Firebug For a simple bare-bones IRLS implementation see http://bwlewis.github.io/GLM/. – Tom Wenseleers Dec 09 '20 at 22:29
@TomWenseleers have you seen this question? I tried to summarize some on what I understand of IWLS, but you probably can add some substantial (and probably more correct) information to it: https://stats.stackexchange.com/q/495033/60613 – Firebug Mar 26 '21 at 18:57
@Firebug Ha no didn't see that one, but your answer looks fine! – Tom Wenseleers Mar 27 '21 at 07:59

Why do we make a big fuss about using Fisher scoring when we fit a GLM?

2 Answers2

Linked