Which optimization algorithm is used in glm function in R?

Question

One can perform a logit regression in R using such code:

> library(MASS)
> data(menarche)
> glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age,
+                                              family=binomial(logit), data=menarche)
> coefficients(glm.out)
(Intercept)         Age 
 -21.226395    1.631968

It looks like the optimization algorithm has converged - there is information about steps number of the fisher scoring algorithm:

Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), 
    data = menarche)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0363  -0.9953  -0.4900   0.7780   1.3675  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -21.22639    0.77068  -27.54   <2e-16 ***
Age           1.63197    0.05895   27.68   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3693.884  on 24  degrees of freedom
Residual deviance:   26.703  on 23  degrees of freedom
AIC: 114.76

Number of Fisher Scoring iterations: 4

I am curious about what optim algorithm it is? Is it Newton-Raphson algorithm (second order gradient descent)? Can I set some parameters to use Cauchy algorithm (first order gradient descent)?

Do you mind citing where a Newton-Raphson algorithm is referred to as "gradient descent level II"? — Cliff AB, Oct 24 '15 at 18:00
FIsher scoring itself is related to, but different from Newton-Raphson, in effect replacing the Hessian with its expected value under the model. — Glen_b, Oct 25 '15 at 00:10
@CliffAB sory. I ment that `Newton's method` is a second order gradient descent method. — Marcin Kosiński, Oct 25 '15 at 15:21
@hxd1011 re: your bounty rationale. Which optimization algorithm(s) is (are) used by GLM in R is a different question than which optimization algorithms could or should be used to fit GLMs, and their relative merits. — Mark L. Stone, Jul 08 '16 at 17:31
@hxd1011, you should not edit to change someone else's question. It is one thing to edit when you think you know what they mean, but their question is unclear (perhaps English isn't their native language, eg), but you should not make their question *different* (eg, more general) than they had wanted. Instead, ask a new question with what you want. I am rolling the edit back. — gung - Reinstate Monica, Jul 08 '16 at 21:07
@Greenparker, please do not approve edits that change someone's question. It is one thing for an edit to clarify what the OP means if their question is unclear (perhaps English isn't their native language, eg), but edits should not make their question different (eg, more general) than the OP had wanted. — gung - Reinstate Monica, Jul 08 '16 at 21:11
@EngrStudent, please do not approve edits that change someone's question. It is one thing for an edit to clarify what the OP means if their question is unclear (perhaps English isn't their native language, eg), but edits should not make their question different (eg, more general) than the OP had wanted. — gung - Reinstate Monica, Jul 08 '16 at 21:11
@MarcinKosiński Newton's method *is* Newton-Raphson, Raphson merely built on Newton's ideas for a more general case. — AdamO, Jul 08 '16 at 22:27
For a minimal IRLS implementation to fit GLMs, see https://stats.stackexchange.com/questions/252652/why-not-fitting-glms-with-least-squares/412582#412582, and for more extensive coverage check http://bwlewis.github.io/GLM/ — Tom Wenseleers, Jun 14 '19 at 23:31
https://stats.stackexchange.com/questions/412580/why-is-r2-not-reported-for-glms-based-on-last-iteration-of-irls-weighted-least-s is maybe also relevant — Tom Wenseleers, Jun 14 '19 at 23:31

Sycorax · Accepted Answer · 2015-10-24T22:19:10.373

20

You will be interested to know that the documentation for glm, accessed via ?glm provides many useful insights: under method we find that iteratively reweighted least squares is the default method for glm.fit, which is the workhorse function for glm. Additionally, the documentation mentions that user-defined functions may be provided here, instead of the default.

edited Oct 24 '15 at 22:19

answered Oct 24 '15 at 17:57

Sycorax

76,417
20
189
313

3

You can also just type the function name `glm` or `fit.glm` at the `R` prompt to study the source code. – Matthew Drury Oct 24 '15 at 19:03
@MatthewDrury I think you mean the workhorse `glm.fit` which will not be entirely reproducible since it relies on C code `C_Cdqrls`. – AdamO Jul 08 '16 at 17:39
Yah, you're right, i mix up the order in R a lot. What do you mean not reproducible though? – Matthew Drury Jul 08 '16 at 17:45

AdamO · Answer 2 · 2016-07-11T15:10:55.440

20

The method used is mentioned in the output itself: it is Fisher Scoring. This is equivalent to Newton-Raphson in most cases. The exception being situations where you are using non-natural parameterizations. Relative risk regression is an example of such a scenario. There, the expected and observed information are different. In general, Newton Raphson and Fisher Scoring give nearly identical results.

Another formulation of Fisher Scoring is that of Iteratively Reweighted Least Squares. To give some intuition, non-uniform error models have the inverse variance weighted least squares model as an "optimal" model according to the Gauss Markov theorem. With GLMs, there is a known mean-variance relationship. An example is logistic regression where the mean is $p$ and the variance is $p(1-p)$. So an algorithm is constructed by estimating the mean in a naive model, creating weights from the predicted mean, then re-estimating the mean using finer precision until there is convergence. This, it turns out, is Fisher Scoring. Additionally, it gives some nice intuition to the EM algorithm which is a more general framework for estimating complicated likelihoods.

The default general optimizer in R uses numerical methods to estimate a second moment, basically based on a linearization (be wary of curse of dimensionality). So if you were interested in comparing the efficiency and bias, you could implement a naive logistic maximum likelihood routine with something like

set.seed(1234)
x <- rnorm(1000)
y <- rbinom(1000, 1, exp(-2.3 + 0.1*x)/(1+exp(-2.3 + 0.1*x)))
f <- function(b) {
  p <- exp(b[1] + b[2]*x)/(1+exp(b[1] + b[2]*x))
  -sum(dbinom(y, 1, p, log=TRUE))
}
ans <- nlm(f, p=0:1, hessian=TRUE)

gives me

> ans$estimate
[1] -2.2261225  0.1651472
> coef(glm(y~x, family=binomial))
(Intercept)           x 
 -2.2261215   0.1651474

edited Jul 11 '16 at 15:10

answered Jul 08 '16 at 17:36

AdamO

52,330
5
104
209

what's the difference comparing to a gradient decent/newton's method/ BFGS? I think you explained, but I am not quite follow. – Haitao Du Jul 08 '16 at 17:40
@hxd1011 see "Numerical Methods for Unconstrained Optimization and Nonlinear Equations" Dennis, J. E. and Schnabel, R. B. – AdamO Jul 08 '16 at 17:43
1

@hxd1011 as far as I can tell, Newton Raphson does not require or estimate a Hessian in the steps. The Newton-Type method in `nlm` estimates the gradient numerically then applies Newton Raphson. In BFGS I think the gradient is required as with Newton Raphson, but successive steps are evaluated using a second order approximation which requires an estimate of the Hessian. BFGS is good for highly nonlinear optimizations. But for GLMs, they are usually very well behaved. – AdamO Jul 08 '16 at 17:58
2

The existing answer mentioned "iteratively reweighted least squares", how does that enter the picture, compared to the algorithms you mentioned here? – amoeba Jul 08 '16 at 23:58
@AdamO could you address amoeba's comments? Thanks – Haitao Du Jul 11 '16 at 15:07
@hxd1011 I added a section having seen this comment only recently. – AdamO Jul 11 '16 at 15:11
@amoeba great point! – AdamO Jul 11 '16 at 15:11
@AdamO Thanks. Unfortunately for my limited knowledge, "using non-natural parameterizations", " Gauss Markov theorem" are still not clear for me. But I will try to read more. I guess may be it is very hard to explain or demo such term? – Haitao Du Jul 11 '16 at 15:18
@hxd1011 they are theoretical results/definitions, but relevant to the question. Most GLMs with which you're familiar are "natural" parametrizations. They are maximum likelihood for known probability models, e.g. logistic regression or Poisson regression. Non-natural parametrizations are relative risk regression or additive risk models. – AdamO Jul 11 '16 at 15:21
@AdamO Thanks, this is very informative. I learned the logistic regression from optimization perspective (I have a related question [here](http://stats.stackexchange.com/questions/222585/what-are-the-impacts-of-choosing-different-loss-functions-in-classification-to-a)) but not statistics perspective. What I learned is just minimizing logistic loss which is convex, using some general method such as gradient decent or newton's method. – Haitao Du Jul 11 '16 at 15:31
So I was wondering - can the Fisher scoring / IRLS algorithm to solve GLMs in fact be considered an EM algorithm? – Tom Wenseleers Aug 27 '19 at 20:56
@TomWenseleers yep! – AdamO Aug 28 '19 at 03:53

score 4 · Answer 3 · answered Aug 27 '19 at 20:53

For the record, a simple pure R implementation of R's glm algorithm, based on Fisher scoring (iteratively reweighted least squares), as explained in the other answer is given by:

glm_irls = function(X, y, weights=rep(1,nrow(X)), family=poisson(log), maxit=25, tol=1e-16) {
    if (!is(family, "family")) family = family()
    variance = family$variance
    linkinv = family$linkinv
    mu.eta = family$mu.eta
    etastart = NULL

    nobs = nrow(X)    # needed by the initialize expression below
    nvars = ncol(X)   # needed by the initialize expression below
    eval(family$initialize) # initializes n and fitted values mustart
    eta = family$linkfun(mustart) # we then initialize eta with this
    dev.resids = family$dev.resids
    dev = sum(dev.resids(y, linkinv(eta), weights))
    devold = 0
    beta_old = rep(1, nvars)

    for(j in 1:maxit)
    {
      mu = linkinv(eta) 
      varg = variance(mu)
      gprime = mu.eta(eta)
      z = eta + (y - mu) / gprime # potentially -offset if you would have an offset argument as well
      W = weights * as.vector(gprime^2 / varg)
      beta = solve(crossprod(X,W*X), crossprod(X,W*z), tol=2*.Machine$double.eps)
      eta = X %*% beta # potentially +offset if you would have an offset argument as well
      dev = sum(dev.resids(y, mu, weights))
      if (abs(dev - devold) / (0.1 + abs(dev)) < tol) break
      devold = dev
      beta_old = beta
    }
    list(coefficients=t(beta), iterations=j)
}

Example:

## Dobson (1990) Page 93: Randomized Controlled Trial :
y <- counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
X <- model.matrix(counts ~ outcome + treatment)

coef(glm.fit(x=X, y=y, family = poisson(log))) 
  (Intercept)      outcome2      outcome3    treatment2    treatment3 
 3.044522e+00 -4.542553e-01 -2.929871e-01 -7.635479e-16 -9.532452e-16

coef(glm_irls(X=X, y=y, family=poisson(log)))
     (Intercept)   outcome2   outcome3    treatment2   treatment3
[1,]    3.044522 -0.4542553 -0.2929871 -3.151689e-16 -8.24099e-16

A good discussion of GLM fitting algorithms, including a comparison with Newton-Raphson (which uses the observed Hessian as opposed to the expected Hessian in the IRLS algorithm) and hybrid algorithms (which start with IRLS, as these are easier to initialize, but then finish with further optimization using Newton-Raphson) can be found in the book "Generalized Linear Models and Extensions" by James W. Hardin & Joseph M. Hilbe.

Which optimization algorithm is used in glm function in R?

3 Answers3

Linked