What is the objective function to optimize in glm with gaussian and poisson family?

Question

I am reading this post and still confused about the different ways of fitting exponential data. Specifically, why I am getting different results with following code? Could anyone help me to write down the objective functions for different model ?

For lm, it is $||X\beta-\log(y)||_2^2$, but what about glm cases?

The reason I ask for objective function is that many literatures are focusing on algorithm details of "re-weighted least square", but lack the emphasis on high level objective.

last_14 = data.frame(rbind(
c(3460,  14,    0),
c(3558,  17,    1),
c(3802,  21,    2),
c(3988,  22,    3),
c(4262,  28,    4),
c(4615,  36,    5),
c(4720,  40,    6),
c(5404,  47,    7),
c(5819,  54,    8),
c(6440,  63,    9),
c(7126,  85,   10),
c(7905, 108,   11),
c(8733, 118,   12),
c(9867, 200,   13)))
names(last_14) = c('World', 'US', 'days')

fit_lm = lm(log(World) ~ days, last_14)
fit_glm = glm(formula = World ~ days,  data=last_14, family=gaussian(link='log'))
fit_glm2 = glm(formula = World ~ days,  data=last_14, family=poisson())

Sextus Empiricus · Accepted Answer · 2020-09-18T13:46:11.053

summary

Linear model with least squares (Gaussian distributed observations)

fit_lm = lm(log(World) ~ days, last_14)

$$\sum_{\forall i} (\log(y_i) - X_i \beta)^2$$

Non-linear model with least squares (Gaussian distributed observations)

using GLM model (with Gaussian distribution family)

fit_glm = glm(formula = World ~ days,  data=last_14, 
family=gaussian(link='log'))

or using non linear least squars (NLS)

fit_nls = nls(World ~ exp(a+b*days), start = list(a = 8, b = 0.1), data=last_14)

$$\sum_{\forall i} (y_i - e^{X_i \beta})^2$$

Poisson regression (Poisson distributed observations)

using GLM model (with Poisson distribution family)

fit_glm2 = glm(formula = World ~ days,  data=last_14, family=poisson())

$$\sum_{\forall i} (e^{X_i \beta} -(X_i \beta)y_i)$$

GLM

The relationship for GLM can be written down as

$$Y_i = f( X_i \beta) + \epsilon_i$$

Sometimes people are instead using the link function $f^{-1}$ to linearize the equation

$$\begin{array}{} f^{-1}(Y_i) = f^{-1}\left( f(X_i \beta) + \epsilon_i \right) \neq X_i \beta + \epsilon\end{array}$$

But that it is not the same. See the last inequality and how $\epsilon$ is placed differently (an example with $f(x)=\exp(x)$ is $\log(\exp(1)+1) \neq 1+1$).

The difference between glm with link function and linearized least squares

The difference is that the error terms are incorporated differently. We can write it down more explicitly for a logarithm/exponential function.

Let the linearized relationship lm(log(World) ~ days) be

$$\log(y_i) = a + b x_i + \epsilon_i$$

Then the non-linearized relationship is:

$$y_i = e^{a + b x_i + \epsilon_i}$$

and this is not like the glm(World ~ days, family=gaussian(link='log'))

$$y_i = e^{a + b x_i} + \epsilon_i$$

The error term $\epsilon_i$ occurs differently in the formula.

The difference with between different families

In the case of the Gaussian/Normal family the following two are the same:

$$Y\vert X \sim \mathcal{N}(\mu = h(X), \sigma^2 )$$

or

$$Y = h(X) + \epsilon \quad \text{where} \quad \epsilon \sim N(0,\sigma^2)$$

this separation into a linear sum of a deterministic component $h(X)$ plus some error/noise term $\epsilon$, will not work the same for other families. For instance for the Poisson distribution you will get that the noise term is larger for a large mean.

Poisson distribution with log link

The log likelihood for a single observation $z$ is

$$L = z X\beta - e^{X\beta}$$

and

$$\frac{\partial L}{\partial \beta_i} = \left( z - e^{X\beta} \right) x_i$$

In the framework of GLM the optimum for this likelihood function is found by iterated least squares solving this likelihood

$$L_{itteration} = 0.5 w(Y^\prime - X\beta)^2$$

with derivative

$$\frac{ \partial L_{itteration}}{\partial \beta_i} = w (Y^\prime - X\beta) x_i$$

and the transformation between the two would be (check https://www.jstor.org/stable/2344614 for the details):

$$Y^\prime = X\beta + \frac{z - e^{X\beta}}{e^{X\beta}}$$

and

$$w = e^{X\beta}$$

where we do not know $e^{X\beta}$ but the current estimate $e^{X\hat\beta}$ can be used and then iteratively improve the result.

Intuitively

You could see GLM as loosely approximating the more general exponential family as Gaussian noise, for $\theta = X\beta$

$$Y \approx f(\theta) + \epsilon \quad \text{where} \quad \epsilon \sim N(0,w\sigma^2) $$

where

the weight $w$ relates to the non-homogeneity of the distribution function (e.g. in the case of Poisson distribution then $\sigma^2 = \mu$)

and in linearized form

$$f^{-1}(Y) \approx \theta + \epsilon + \frac{Y-f(\theta + \epsilon)}{\partial f(\theta) / \partial \theta } \quad \text{where} \quad \epsilon \sim N(0,w\sigma^2) $$

where

the term $\frac{Y-f(\theta + \epsilon)}{\partial f(\theta) / \partial \theta }$ relates to the non-linearity in the effect of errors on the response when a link function is applied to the response. (ie. the model of the error distribution is for $Y$ and not for $f^{-1}(Y)$ and that needs to be corrected. So that is an additional correction, aside from the weights which only correct for the non-homogeneity in the variance of $Y\vert X$ and not $f^{-1}(Y) \vert X$)

Computational demonstration

days <- last_14$days
US <- last_14$US

### iterrating
Y <- last_14$US
X <- cbind(rep(1,14),last_14$days)
coef <- c(2,0.3)                 # begin solution
yp <- exp(X %*% coef)
for (i in 1:100) {
  w <- as.numeric(yp)            # weights         
  Yprime <- log(yp) + (Y-yp)/yp  # y-values
  coef <- solve(crossprod(X,w*X), crossprod(X,w*Yprime))
  yp <- exp(X %*% coef)          # new solution
}

### glm function
modglm <- glm(US ~ days,  
              family = poisson(link = "log"), 
              control = list(epsilon = 10^-20, maxit = 100))


### direct optimization of likelihood
Loption = "Poisson"
L <- function(x) {
  a <- x[1]
  b <- x[2]
  Xb <- a+b*days
  if (Loption == "Poisson") {
    return(-sum(Y*Xb-exp(Xb)))
  } 
  if (Loption == "Gaussian loglink") {
    return(sum((Y-exp(Xb))^2))
  } 
  if (Loption == "linearized model") {
    return(sum((log(Y)-Xb)^2))
  } 
} 

start <- c(a=2,b=0.3)
modoptim <- optim(par = start,fn = L)

Which give the same results

> # glm model
> modglm$coefficients
(Intercept)        days 
  2.4750654   0.2030466 

> # optimizing likelihood function
> modoptim$par
        a         b 
2.4745912 0.2031048 

> # manual computation
> coef
         [,1]
[1,] 2.4750654
[2,] 0.2030466
>

Computations for other cases

Below are the other cases. Note that the GLM function with Gaussian family can also be alternatively done with nls.

> ###for the other cases
> 
> Loption = "Gaussian loglink"
> optim(par = start,fn = L)$par
        a         b 
2.1735638 0.2315177 
> glm(formula = US ~ days,  data=last_14, family=gaussian(link='log'))

Call:  glm(formula = US ~ days, family = gaussian(link = "log"), data = last_14)

Coefficients:
(Intercept)         days  
     2.1736       0.2315  

Degrees of Freedom: 13 Total (i.e. Null);  12 Residual
Null Deviance:      35020 
Residual Deviance: 1375     AIC: 110
> nls(US ~ exp(a+b*days), start = list(a=2,b=0.2))
Nonlinear regression model
  model: US ~ exp(a + b * days)
   data: parent.frame()
     a      b 
2.1736 0.2315 
 residual sum-of-squares: 1375

Number of iterations to convergence: 7 
Achieved convergence tolerance: 3.19e-06
> 
> 
> Loption = "linearized model"
> optim(par = start,fn = L)$par
        a         b 
2.5917459 0.1879523 
> lm(log(US) ~ days)

Call:
lm(formula = log(US) ~ days)

Coefficients:
(Intercept)         days  
     2.5918       0.1879

$Y_i = f( X_i \beta) + \epsilon_i$ confuses me. What would $\epsilon_i$ be for a logistic regression? — Dave, May 13 '20 at 16:24
We often take $\epsilon \sim N(0,\sigma^2)$ but you could use any other distribution (in the case of GLM it is limited to a distribution from a particular exponential family). Also the function $f$ can be anything. For logistic regression $f$ is the exponential function. — Sextus Empiricus, May 13 '20 at 16:38
More precise would be to write that $Y\vert X \sim D(\mu = f(X\beta))$ and Y can be seen as following a distribution $D$ for which the mean parameter $E(Y\vert X)$ is equal to $f(X\beta)$. The description of a linear sum of a deterministic part and a random part makes indeed more sense for least squares (normal distribution). — Sextus Empiricus, May 13 '20 at 16:43
optimizing likelihood function on POISSON one `modoptim$par` is not exactly equal to manual calculation or glm results. Could help me to understand why? — Haitao Du, May 21 '20 at 08:16
@HaitaoDu I am not sure but I guess that the (small) discrepancy is because the algorithm used in `optim` stops already early before reaching the exact solution (optim is using a gradient method to itteratively improve the estimate, obtaining an estimate with a higher likelihood each step and getting closer to the exact solution each step, and stops when the steps of the improvements are smaller than a certain factor times the machine precision). You can change the parameters to make optim return a more accurate result. — Sextus Empiricus, May 21 '20 at 09:03
Thanks for your comments i think `control = list(reltol=1e-30)` fixed it. — Haitao Du, May 21 '20 at 09:32
Example: for Nelder-Mead you can change the `reltol` parameter like this: `modoptim — Sextus Empiricus, May 21 '20 at 09:51
This is by far the clearest explanation of the GLM which I have ever read. — hasManyStupidQuestions, Jul 09 '20 at 20:20