why not Fitting GLMs with least squares?

Question

My question is why we don't use least square to fit Generalized linear model parameters and instead always use maximum likelihood.

To a large extent this really boils down to "why do people use maximum likelihood to estimate parameters" to which a number of other questions may be relevant. Some of the answers at [MLE - why is it used ...](http://stats.stackexchange.com/questions/183006/maximum-likelihood-estimation-why-it-is-used-despite-being-biased-in-many-cas) may be of some value to you (e.g. Richard's and John's answers explain some of the attractions of estimation by maximizing likelihood), though a more specific answer for the case of GLMs would be useful. — Glen_b, Dec 21 '16 at 03:52
Two points worth noting: 1) In the common case of Gaussian errors, [least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) **is** the MLE. 2) In MLE for GLM, [(iterative) least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) **is** commonly used "under the hood". — GeoMatt22, Dec 21 '16 at 05:07
Thanks for your answers, but @GeoMatt22 why don't we use least squares instead of iterative least squares — Zach007, Dec 21 '16 at 05:15
One reason is that often it simply will not work, tell me how to use least squares to fit a logistic regression! — kjetil b halvorsen, Sep 29 '17 at 20:43
@kjetilbhalvorsen See below - GLMs are usually fit using iteratively reweighted least squares... — Tom Wenseleers, Jun 11 '19 at 21:59
@Tom but that is a numerical method for max likelihood, see my answer https://stats.stackexchange.com/questions/236676/can-you-give-a-simple-intuitive-explanation-of-irls-method-to-find-the-mle-of-a/237384#237384. I took the answer as asking for using LS directly, as an alternatie to maxlik. Its not clear how that could be done. — kjetil b halvorsen, Jun 11 '19 at 22:48
@kjetilbhalvorsen Ha OK no worries I see where you were coming from! Yes in one step one cannot fit ML models using LS of course. At best, it would be a single step Newton-Raphson / Fisher scoring approximation to the true ML function (which in many cases can still be quite good though if you use a sensible initialisation for the weights, e.g. 1/(y+1) for Poisson)... — Tom Wenseleers, Jun 11 '19 at 23:19
this SE Q&A might be of interest too https://stats.stackexchange.com/questions/326350 — jld, Jun 12 '19 at 00:16
From a practical purpose, I have usually found that Least Square is not only ok but also much faster and solution not reasonably different from what a Maximum Likelihood solution would give for most cases - as long as the variables are transformed to a form that gives your desired effect such as non-negative weights or any range constrained weights. Nothing hard and fast about an Maximum Likelihood estimator. The true sound solution would be an Unbiased Minimum Variance estimator, and Maximum Likelihood is just as approximation, just like a Least Square would be another approximation. — uday, Jun 12 '19 at 12:43

Tom Wenseleers · Answer 1 · 2019-06-12T16:59:42.190

GLMs ARE usually fit using iteratively reweighted least squares, see here and references list there, and this post ! This method is based on maximizing the maximum likelihood objective based on Fisher scoring, which is a variant of Newton-Raphson. In a single step one could only approximate the true ML function using least squares though - this would then come down to using a single step of this Fisher scoring algorithm. In practice, the solution obtained in that way can still be quite good though, especially if you use sensible initialisations (e.g. using 1/(y+1) as approximate 1/variance weights in the weighted least squares regression if you are dealing with Poisson noise where the mean=variance). See this post for an example.

A minimal implementation would be:

glm.irls = function(X, y, family=binomial, maxit=25, tol=1e-08, beta.start=rep(0,ncol(X))) {
    if (is.function(family)) family <- family()
    beta = beta.start
    for(j in 1:maxit)
    {
      eta    = X %*% beta
      g      = family$linkinv(eta)
      gprime = family$mu.eta(eta)
      z      = eta + (y - g) / gprime
      W      = as.vector(gprime^2 / family$variance(g))
      betaold   = beta
      beta      = as.matrix(coef(lm.wfit(x=X, y=z, w=W)),ncol=1) # regular weighted LS fit = solve(crossprod(X,W*X), crossprod(X,W*z))
      if(sqrt(crossprod(beta-betaold)) < tol) break
    }
    list(coefficients=beta, iterations=j, z=z, weights=W, X=X, y=y, wlmfit=lm(z~1+X[,-1], weights=W))
}

If you use a distribution with an identity link you can see that in the algorithm above z=y and each iteration just comes down to doing a weighted least squares regression with 1/variance weights. For Poisson e.g. one would then use initial weights = 1/(y+small epsilon) and iterate this based on the predicted yhat, where your weights will then become 1/yhat. With Gaussian errors (ie regular OLS regression) the weights will all just be equal to 1 which means you will get your solution in 1 iteration as there are mo weights to optimize.

Example for logistic regression:

data("Contraception",package="mlmRev")
R_GLM = glm(use ~ age + I(age^2) + urban + livch,
            family=binomial, x=T, data=Contraception)
IRLS_GLM = glm.irls(X=R_GLM$x, y=R_GLM$y, family=binomial)
print(data.frame(R_GLM=coef(R_GLM), IRLS_GLM=coef(IRLS_GLM))) # coefficients match with glm output
                   R_GLM     IRLS_GLM
(Intercept) -0.949952124 -0.949952124
age          0.004583726  0.004583726
I(age^2)    -0.004286455 -0.004286455
urbanY       0.768097459  0.768097459
livch1       0.783112821  0.783112821
livch2       0.854904050  0.854904050
livch3+      0.806025052  0.806025052

olooney · Answer 2 · 2019-06-11T23:55:45.140

I can think of a couple reasons: one theoretical, the other practical.

The theoretical reason is because we want to use maximum likelihood estimation (MLE.) This not only gives us a principled way to pose the model fitting procedure as an optimization problem, there are also theoretical results which only work when we use MLE, such as Wilk's theorem which gives rise to the likelihood ratio test and the analysis of deviance, or Wald's test which gives us a way to test the significance of parameters in a generalized linear model. All of these tools go away if we don't use maximum likelihood estimation.

However, there is also another very practical answer: the squared error loss function is not always convex for GLMs. For example, it is easy to prove this true for logistic regression. You can even see the loss function start to curve downward near 0 and 1:

So even if we wanted to use MSE as an atheoretical loss function this non-convexity could cause problems during fitting!

why not Fitting GLMs with least squares?

2 Answers2

Linked