How to replicate Stata's robust binomial GLM for proportion data in R?

Question

There is an example on how to run a GLM for proportion data in Stata here

The IV is the proportion of students receiving free or reduced priced meals at school. The stata model looks like this.:

glm meals yr_rnd parented api99, link(logit) family(binomial) robust nolog

I'm interested in learning how to replicate this results in R (ideally using the same robust approach). Lets imagine that I have data about the number of students receiving free meals (Successes) and the rest of the students (Failures). I'm guessing the model in R could look something like this:

fitglm <- glm(cbind(Successes,Failures) ~ yr_rnd + parented + api99, family=binomial)

Also, it was pointed out to me elsewhere (Penguin_Knight) that the error message "meals has non-integer values" could be bad. I'm clueless regarding this error...

In Stata `vce(robust)` rather than `robust` is what is now documented, but `robust` should still work. — Nick Cox, Mar 14 '14 at 11:43

score 10 · Accepted Answer · edited May 23 '17 at 12:39

Using the R package sandwich, you can replicate the results like that (I assume that you've already downloaded the dataset):

#-----------------------------------------------------------------------------
# Load the required packages
#-----------------------------------------------------------------------------

require(foreign)
require(sandwich)

#-----------------------------------------------------------------------------
# Load the data
#-----------------------------------------------------------------------------

dat <- read.dta("MyPath/proportion.dta")

#-----------------------------------------------------------------------------
# Inspect dataset
#-----------------------------------------------------------------------------

str(dat)

#-----------------------------------------------------------------------------
# Fit the glm
#-----------------------------------------------------------------------------

fitglm <- glm(meals ~ yr_rnd + parented + api99, family = binomial(logit), data = dat)

#-----------------------------------------------------------------------------
# Output of the model
#-----------------------------------------------------------------------------

summary(fitglm)

#-----------------------------------------------------------------------------
# Calculate robust standard errors
#-----------------------------------------------------------------------------

cov.m1 <- vcovHC(fitglm, type = "HC0")

std.err <- sqrt(diag(cov.m1))

q.val <- qnorm(0.975)

r.est <- cbind(
  Estimate = coef(fitglm)
  , "Robust SE" = std.err
  , z = (coef(fitglm)/std.err)
  , "Pr(>|z|) "= 2 * pnorm(abs(coef(fitglm)/std.err), lower.tail = FALSE)
  , LL = coef(fitglm) - q.val  * std.err
  , UL = coef(fitglm) + q.val  * std.err
)

r.est

The model output using robust standard errors is:

                Estimate   Robust SE         z     Pr(>|z|)            LL           UL
(Intercept)  6.801682703 0.072368970  93.98618  0.000000e+00  6.659842129  6.943523277
yr_rndYes    0.048252657 0.032167588   1.50004  1.336041e-01 -0.014794657  0.111299970
parented    -0.766259824 0.039066917 -19.61403  1.173462e-85 -0.842829574 -0.689690073
api99       -0.007304603 0.000215534 -33.89072 9.127821e-252 -0.007727042 -0.006882164

The estimates and standard errors are fairly similar to those calculated using Stata. I don't know why the intercept is different though. The Stata-output is:

------------------------------------------------------------------------------
             |               Robust
       meals |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      yr_rnd |   .0482527   .0321714     1.50   0.134    -.0148021    .1113074
    parented |  -.7662598   .0390715   -19.61   0.000    -.8428386   -.6896811
       api99 |  -.0073046   .0002156   -33.89   0.000    -.0077271   -.0068821
       _cons |    6.75343   .0896767    75.31   0.000     6.577667    6.929193
------------------------------------------------------------------------------

There are several methods available for the function vcovHC. Consult the help file of vcovHC for the details.

Note that if you use the option family = quasibinomial(logit), there will be no error message (see here).

+1. What would happen if you use `glm()` with `family=quasibinomial`? Isn't it supposed to estimate robust standard errors by itself, or at least do something conceptually similar by computing standard errors accounting for over-dispersion? — amoeba, Sep 05 '16 at 19:35
@amoeba: while `family = quasibinomial` leads to *a* robust estimate of the variance ($\alpha \hat p (1- \hat p)$), it is not the same robust estimator as the sandwich estimator (in which for each observation, the variance is estimated as $(y_i - \hat y_i)^2$ rather than by some assumed functional form). — Cliff AB, Sep 05 '16 at 22:39
@CliffAB Thanks a lot! This agrees with what I've been reading during the last hour, e.g. [in this book](https://books.google.pt/books?id=4M8dCgAAQBAJ&lpg=PA63&ots=bHYmWd9wH8&dq=quasibinomial%20glm&pg=PA61#v=onepage&q=quasibinomial%20glm&f=false) or [in this SO post](http://stackoverflow.com/questions/19893133/fractional-logit-model-in-r) and linked SO posts. (Do you have better references?) It is a pity we do not seem to have a good CV thread that would accurately explain different approaches. — amoeba, Sep 05 '16 at 22:43
@amoeba: I'm afraid I don't have any good references off hand; I just learned about it in a class which did not use a book. I recall my professor mentioning that it was an idea that actually began in the field of economics before being readily accepted by statisticians. But that's all I've got. — Cliff AB, Sep 05 '16 at 23:34

score 3 · Answer 2 · edited May 23 '17 at 12:39

You can replicate the UCLA FAQ on proportions (with a percentage as a dependent variable) as follows:

require(foreign);require(lmtest);require(sandwich)
meals <- read.dta("http://www.ats.ucla.edu/stat/stata/faq/proportion.dta")
fitperc <- glm(meals ~ yr_rnd + parented + api99, family = binomial, data=meals)
## Warning message:
## In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

I don't know if the warning above is an issue here or not. For some reason the intercept don't match in R and Stata, but since we don't interpret it usually in logit/probit anyway it shouldn't matter much.

summary(fitperc)
## 
## Call:
## glm(formula = meals ~ yr_rnd + parented + api99, family = binomial, 
##     data = meals, na.action = na.exclude)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.77722  -0.18995  -0.01649   0.18692   1.60959  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.801683   0.231914  29.329   <2e-16 ***
## yr_rndYes    0.048253   0.104210   0.463    0.643    
## parented    -0.766260   0.090733  -8.445   <2e-16 ***
## api99       -0.007305   0.000506 -14.435   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1953.94  on 4256  degrees of freedom
## Residual deviance:  395.81  on 4253  degrees of freedom
##   (164 observations deleted due to missingness)
## AIC: 2936.7
## 
## Number of Fisher Scoring iterations: 5

In R the small-sample corrections used are different than those in Stata, but the robust SEs are fairly similar:

coeftest(fitperc, function(x) vcovHC(x, type = "HC1"))
## 
## z test of coefficients:
## 
##                Estimate  Std. Error  z value Pr(>|z|)    
## (Intercept)  6.80168270  0.07240299  93.9420   <2e-16 ***
## yr_rndYes    0.04825266  0.03218271   1.4993   0.1338    
## parented    -0.76625982  0.03908528 -19.6048   <2e-16 ***
## api99       -0.00730460  0.00021564 -33.8748   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To use the exact same small-sample correction you need to follow this post:

sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(fitperc, vcov = sandwich1)
## 
## z test of coefficients:
## 
##                Estimate  Std. Error  z value Pr(>|z|)    
## (Intercept)  6.80168270  0.07237747  93.9751   <2e-16 ***
## yr_rndYes    0.04825266  0.03217137   1.4999   0.1336    
## parented    -0.76625982  0.03907151 -19.6117   <2e-16 ***
## api99       -0.00730460  0.00021556 -33.8867   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The log likelihood and the confidence intervals (slightly different as the estimation procedure seems to be different):

logLik(fitperc)
## 'log Lik.' -1464.363 (df=4)
confint(fitperc)
## Waiting for profiling to be done...
##                    2.5 %       97.5 %
## (Intercept)  6.352788748  7.262067304
## yr_rndYes   -0.155529338  0.253123151
## parented    -0.944775733 -0.588903012
## api99       -0.008303668 -0.006319185

To obtain the predictions:

meals_pred <- data.frame(api99=rep(c(500,600,700), 2), 
           yr_rnd=rep(c("No", "Yes"), times=1, each=3), 
           parented=rep(2.5, 6))
cbind(meals_pred, pred=predict(fitperc, meals_pred, "response"))
##   api99 yr_rnd parented      pred
## 1   500     No      2.5 0.7744710
## 2   600     No      2.5 0.6232278
## 3   700     No      2.5 0.4434458
## 4   500    Yes      2.5 0.7827873
## 5   600    Yes      2.5 0.6344891
## 6   700    Yes      2.5 0.4553849

See this question for a related discussion:

Estimating percentages as the dependent variable in regression

How to replicate Stata's robust binomial GLM for proportion data in R?

2 Answers2

Linked