R: glm function with family = "binomial" and "weight" specification

Question

I am very confused with how weight works in glm with family="binomial". In my understanding, the likelihood of the glm with family = "binomial" is specified as follows: $$ f(y) = {n\choose{ny}} p^{ny} (1-p)^{n(1-y)} = \exp \left(n \left[ y \log \frac{p}{1-p} - \left(-\log (1-p)\right) \right] + \log {n \choose ny}\right) $$ where $y$ is the "proportion of observed success" and $n$ is the known number of trials.

In my understanding, the probability of success $p$ is parametrized with some linear coefficients $\beta$ as $p=p(\beta)$ and glm function with family = "binomial" search for: $$ \textrm{arg}\max_{\beta} \sum_i \log f(y_i). $$ Then this optimization problem can be simplified as:

$$ \textrm{arg}\max_{\beta} \sum_i \log f(y_i)= \textrm{arg}\max_{\beta} \sum_i n_i \left[ y_i \log \frac{p(\beta)}{1-p(\beta)} - \left(-\log (1-p(\beta))\right) \right] + \log {n_i \choose n_iy_i}\\ = \textrm{arg}\max_{\beta} \sum_i n_i \left[ y_i \log \frac{p(\beta)}{1-p(\beta)} - \left(-\log (1-p(\beta))\right) \right] \\ $$
Therefore if we let $n_i^*=n_ic$ for all $i=1,...,N$ for some constant $c$, then it must also be true that: $$ \textrm{arg}\max_{\beta} \sum_i \log f(y_i) = \textrm{arg}\max_{\beta} \sum_i n^*_i \left[ y_i \log \frac{p(\beta)}{1-p(\beta)} - \left(-\log (1-p(\beta))\right) \right] \\ $$ From this, I thought that Scaling of the number of trials $n_i$ with a constant does NOT affect the maximum likelihood estimates of $\beta$ given the proportion of success $y_i$.

The help file of glm says:

 "For a binomial GLM prior weights are used to  
  give the number of trials when the response is 
  the proportion of successes"

Therefore I expected that the scaling of weight would not affect the estimated $\beta$ given the proportion of success as response. However the following two codes return different coefficient values:

 Y <- c(1,0,0,0) ## proportion of observed success
 w <- 1:length(Y) ## weight= the number of trials
 glm(Y~1,weights=w,family=binomial)

This yields:

 Call:  glm(formula = Y ~ 1, family =  
            "binomial", weights = w)

 Coefficients:
 (Intercept)  
      -2.197

while if I multiply all weights by 1000, the estimated coefficients are different:

 glm(Y~1,weights=w*1000,family=binomial)

 Call:  glm(formula = Y ~ 1, family = binomial,  
            weights = w * 1000)

 Coefficients:
 (Intercept)  
    -3.153e+15

I saw many other examples like this even with some moderate scaling in weights. What is going on here?

For what it's worth, the `weights` argument ends up in two places inside the `glm.fit` function (in [glm.R](https://github.com/wch/r-source/blob/master/src/library/stats/R/glm.R)), which is what does the work in R: 1) in the deviance residuals, by way of the C function `binomial_dev_resids` (in [family.c](https://github.com/wch/r-source/blob/master/src/library/stats/src/family.c)) and 2) in the IWLS step by way of `Cdqrls` (in [lm.c](https://github.com/wch/r-source/blob/master/src/library/stats/src/lm.c)). I don't know enough C to be of more help in tracing the logic — shadowtalker, Mar 12 '15 at 13:59
Check the replies [here](https://www.mail-archive.com/r-help@r-project.org/msg92655.html). — Stat, Mar 12 '15 at 18:02
@ssdecontrol I am reading through glm.fit in the link that you gave me but I cannot find where the C function "binomial_dev_resids" is called in glm.fit. Would you mind if you point it out? — FairyOnIce, Mar 19 '15 at 01:55
@ssdecontrol Oh, sorry I think I understand. Each "family" is a list and one of the elements is "dev.resids". When I type binomial in R console, I see the definition of the binomial object and it has a line:dev.resids — FairyOnIce, Mar 19 '15 at 02:05

AdamO · Answer 1 · 2017-12-01T22:48:10.233

Your example is merely causing rounding error in R. Large weights don't perform well in glm. It's true that scaling w by virtually any smaller number, like 100, leads to same estimates as the unscaled w.

If you want more reliable behavior with the weights arguments, try using the svyglm function from the survey package.

See here:

    > svyglm(Y~1, design=svydesign(ids=~1, weights=~w, data=data.frame(w=w*1000, Y=Y)), family=binomial)
Independent Sampling design (with replacement)
svydesign(ids = ~1, weights = ~w, data = data.frame(w = w * 1000, 
    Y = Y))

Call:  svyglm(formula = Y ~ 1, design = svydesign(ids = ~1, weights = ~w2, 
    data = data.frame(w2 = w * 1000, Y = Y)), family = binomial)

Coefficients:
(Intercept)  
     -2.197  

Degrees of Freedom: 3 Total (i.e. Null);  3 Residual
Null Deviance:      2.601 
Residual Deviance: 2.601    AIC: 2.843

score 1 · Answer 2 · edited Apr 08 '21 at 05:26

I think it comes down to the initial values that is used in glm.fit from the family$initialize which makes the method divergere. As far as I know, glm.fit solve the problem by forming a QR-decomposition of $\sqrt{W}X$ where $X$ is the design matrix and $\sqrt{W}$ is a diagonal with square roots of the entries as described here. That is, uses a Newton–Raphson method.

The relevant $intialize code is:

if (NCOL(y) == 1) {
    if (is.factor(y)) 
        y <- y != levels(y)[1L]
    n <- rep.int(1, nobs)
    y[weights == 0] <- 0
    if (any(y < 0 | y > 1)) 
        stop("y values must be 0 <= y <= 1")
    mustart <- (weights * y + 0.5)/(weights + 1)
    m <- weights * y
    if (any(abs(m - round(m)) > 0.001)) 
        warning("non-integer #successes in a binomial glm!")
}

Here is a simplified version of glm.fit which shows my point

> #####
> # setup
> y <- matrix(c(1,0,0,0), ncol = 1)
> weights <- 1:nrow(y) * 1000
> nobs <- length(y)
> family <- binomial()
> X <- matrix(rep(1, nobs), ncol = 1) # design matrix used later
> 
> # set mu start as with family$initialize
> if (NCOL(y) == 1) {
+   n <- rep.int(1, nobs)
+   y[weights == 0] <- 0
+   mustart <- (weights * y + 0.5)/(weights + 1)
+   m <- weights * y
+   if (any(abs(m - round(m)) > 0.001)) 
+     warning("non-integer #successes in a binomial glm!")
+ }
> 
> mustart # starting value
             [,1]
[1,] 0.9995004995
[2,] 0.0002498751
[3,] 0.0001666111
[4,] 0.0001249688
> (eta <- family$linkfun(mustart))
          [,1]
[1,]  7.601402
[2,] -8.294300
[3,] -8.699681
[4,] -8.987322
> 
> #####
> # Start loop to fit
> mu <- family$linkinv(eta)
> mu_eta <- family$mu.eta(eta)
> z <- drop(eta + (y - mu) / mu_eta)
> w <- drop(sqrt(weights * mu_eta^2 / family$variance(mu = mu)))
> 
> # code is simpler here as (X^T W X) is a scalar
> X_w <- X * w
> (.coef <- drop(crossprod(X_w)^-1 * ((w * z) %*% X_w)))
[1] -5.098297
> (eta <- .coef * X)
          [,1]
[1,] -5.098297
[2,] -5.098297
[3,] -5.098297
[4,] -5.098297
> 
> # repeat a few times from "start loop to fit"

We can repeat the last part two more times to see that Newton-Raphson method diverges:

> #####
> # Start loop to fit
> mu <- family$linkinv(eta)
> mu_eta <- family$mu.eta(eta)
> z <- drop(eta + (y - mu) / mu_eta)
> w <- drop(sqrt(weights * mu_eta^2 / family$variance(mu = mu)))
> 
> # code is simpler here as (X^T W X) is a scalar
> X_w <- X * w
> (.coef <- drop(crossprod(X_w)^-1 * ((w * z) %*% X_w)))
[1] 10.47049
> (eta <- .coef * X)
         [,1]
[1,] 10.47049
[2,] 10.47049
[3,] 10.47049
[4,] 10.47049
> 
> 
> #####
> # Start loop to fit
> mu <- family$linkinv(eta)
> mu_eta <- family$mu.eta(eta)
> z <- drop(eta + (y - mu) / mu_eta)
> w <- drop(sqrt(weights * mu_eta^2 / family$variance(mu = mu)))
> 
> # code is simpler here as (X^T W X) is a scalar
> X_w <- X * w
> (.coef <- drop(crossprod(X_w)^-1 * ((w * z) %*% X_w)))
[1] -31723.76
> (eta <- .coef * X)
          [,1]
[1,] -31723.76
[2,] -31723.76
[3,] -31723.76
[4,] -31723.76

This does not happen if you start with weights <- 1:nrow(y) or say weights <- 1:nrow(y) * 100.

Notice that you can avoid divergence by setting the mustart argument. E.g. do

> glm(Y ~ 1,weights = w * 1000, family = binomial, mustart = rep(0.5, 4))

Call:  glm(formula = Y ~ 1, family = binomial, weights = w * 1000, mustart = rep(0.5, 
    4))

Coefficients:
(Intercept)  
     -2.197  

Degrees of Freedom: 3 Total (i.e. Null);  3 Residual
Null Deviance:      6502 
Residual Deviance: 6502     AIC: 6504

I think weights affects more than arguments to initialize. With logistic regression, Newton Raphson estimates the maximum likelihood which exists and is unique when the data aren't separated. Supplying different starting values to the optimizer will not arrive at different values, but will take perhaps longer to get there. — AdamO, Nov 25 '17 at 20:49
_"Supplying different starting values to the optimizer will not arrive at different values ..."_. Well the Newton method does not diverge and finds the unique maximum in the last example where I set the initial values (see the example where I provide the `mustart ` argument). It seems like a matter related to [Poor initial estimate](https://en.wikipedia.org/wiki/Newton%27s_method#Practical_considerations). — Benjamin Christoffersen, Nov 25 '17 at 22:32

R: glm function with family = "binomial" and "weight" specification

2 Answers2

Linked