11

I'm trying to run a zero-inflated regression for a continuous response variable in R. I'm aware of a gamlss implementation, but I'd really like to try out this algorithm by Dale McLerran that is conceptually a bit more straightforward. Unfortunately, the code is in SAS and I'm not sure how to re-write it for something like nlme.

The code is as follows:

proc nlmixed data=mydata;
  parms b0_f=0 b1_f=0 
        b0_h=0 b1_h=0 
        log_theta=0;


  eta_f = b0_f + b1_f*x1 ;
  p_yEQ0 = 1 / (1 + exp(-eta_f));


  eta_h = b0_h + b1_h*x1;
  mu    = exp(eta_h);
  theta = exp(log_theta);
  r = mu/theta;


  if y=0 then
     ll = log(p_yEQ0);
  else
     ll = log(1 - p_yEQ0)
          - lgamma(theta) + (theta-1)*log(y) - theta*log(r) - y/r;


  model y ~ general(ll);
  predict (1 - p_yEQ0)*mu out=expect_zig;
  predict r out=shape;
  estimate "scale" theta;
run;

From: http://listserv.uga.edu/cgi-bin/wa?A2=ind0805A&L=sas-l&P=R20779

ADD:

Note: There are no mixed effects present here - only fixed.

The advantage to this fitting is that (even though the coefficients are the same as if you separately fit a logistic regression to P(y=0) and a gamma error regression with log link to E(y | y>0)) you can estimate the combined function E(y) which includes the zeroes. One can predict this value in SAS (with a CI) using the line predict (1 - p_yEQ0)*mu .

Further, one is able to write custom contrast statements to test the significance of predictor variables on E(y). For example, here is another version of the SAS code I have used:

proc nlmixed data=TestZIG;
      parms b0_f=0 b1_f=0 b2_f=0 b3_f=0
            b0_h=0 b1_h=0 b2_h=0 b3_h=0
            log_theta=0;


        if gifts = 1 then x1=1; else x1 =0;
        if gifts = 2 then x2=1; else x2 =0;
        if gifts = 3 then x3=1; else x3 =0;


      eta_f = b0_f + b1_f*x1 + b2_f*x2 + b3_f*x3;
      p_yEQ0 = 1 / (1 + exp(-eta_f));

      eta_h = b0_h + b1_h*x1 + b2_h*x2 + b3_h*x3;
      mu    = exp(eta_h);
      theta = exp(log_theta);
      r = mu/theta;

      if amount=0 then
         ll = log(p_yEQ0);
      else
         ll = log(1 - p_yEQ0)
              - lgamma(theta) + (theta-1)*log(amount) -                      theta*log(r) - amount/r;

      model amount ~ general(ll);
      predict (1 - p_yEQ0)*mu out=expect_zig;
      estimate "scale" theta;
    run; 

Then to estimate "gift1" versus "gift2" (b1 versus b2) we can write this estimate statement:

estimate "gift1 versus gift 2" 
 (1-(1 / (1 + exp(-b0_f -b1_f))))*(exp(b0_h + b1_h)) - (1-(1 / (1 + exp(-b0_f -b2_f))))*(exp(b0_h + b2_h)) ; 

Can R do this?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
a11msp
  • 743
  • 6
  • 20
  • 2
    user779747 did note in his cross posting to Rhelp that this had been posted here first. I haven't seen a specific request to post such a notice in SO, but some (most?) of us cross-helpeRs expect it because that is the stated expectation in the R Mailing Lists. – DWin Sep 14 '11 at 16:27

1 Answers1

10

Having spent some time on this code, it appears to me as though it basically:

1) Does a logistic regression with right hand side b0_f + b1_f*x1 andy > 0 as a target variable,

2) For those observations for which y > 0, performs a regression with right hand side b0_h + b1_h*x1, a Gamma likelihood and link=log,

3) Also estimates the shape parameter of the Gamma distribution.

It maximizes the likelihood jointly, which is nice, because you only have to make the one function call. However, the likelihood separates anyway, so you don't get improved parameter estimates as a result.

Here is some R code which makes use of the glm function to save programming effort. This may not be what you'd like, as it obscures the algorithm itself. The code certainly isn't as clean as it could / should be, either.

McLerran <- function(y, x)
{
  z <- y > 0
  y.gt.0 <- y[y>0]
  x.gt.0 <- x[y>0]

  m1 <- glm(z~x, family=binomial)
  m2 <- glm(y.gt.0~x.gt.0, family=Gamma(link=log))

  list("p.ygt0"=m1,"ygt0"=m2)
}

# Sample data
x <- runif(100)
y <- rgamma(100, 3, 1)      # Not a function of x (coef. of x = 0)
b <- rbinom(100, 1, 0.5*x)  # p(y==0) is a function of x
y[b==1] <- 0

foo <- McLerran(y,x)
summary(foo$ygt0)

Call:
glm(formula = y.gt.0 ~ x.gt.0, family = Gamma(link = log))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.08888  -0.44446  -0.06589   0.28111   1.31066  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.2033     0.1377   8.737 1.44e-12 ***
x.gt.0       -0.2440     0.2352  -1.037    0.303    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Gamma family taken to be 0.3448334)

    Null deviance: 26.675  on 66  degrees of freedom
Residual deviance: 26.280  on 65  degrees of freedom
AIC: 256.42

Number of Fisher Scoring iterations: 6

The shape parameter for the Gamma distribution is equal to 1 / the dispersion parameter for the Gamma family. Coefficients and other stuff which you might like to access programatically can be accessed on the individual elements of the return value list:

> coefficients(foo$p.ygt0)
(Intercept)           x 
   2.140239   -2.393388 

Prediction can be done using the output of the routine. Here's some more R code that shows how to generate expected values and some other information:

# Predict expected value
predict.McLerren <- function(model, x.new)
{
  x <- as.data.frame(x.new)
  colnames(x) <- "x"
  x$x.gt.0 <- x$x

  pred.p.ygt0 <- predict(model$p.ygt0, newdata=x, type="response", se.fit=TRUE)
  pred.ygt0 <- predict(model$ygt0, newdata=x, type="response", se.fit=TRUE)  

  p0 <- 1 - pred.p.ygt0$fit
  ev <- (1-p0) * pred.ygt0$fit

  se.p0 <- pred.p.ygt0$se.fit
  se.ev <- pred.ygt0$se.fit

  se.fit <- sqrt(((1-p0)*se.ev)^2 + (ev*se.p0)^2 + (se.p0*se.ev)^2)

  list("fit"=ev, "p0"=p0, "se.fit" = se.fit,
       "pred.p.ygt0"=pred.p.ygt0, "pred.ygt0"=pred.ygt0)
}

And a sample run:

> x.new <- seq(0.05,0.95,length=5)
> 
> foo.pred <- predict.McLerren(foo, x.new)
> foo.pred$fit
       1        2        3        4        5 
2.408946 2.333231 2.201889 2.009979 1.763201 
> foo.pred$se.fit
        1         2         3         4         5 
0.3409576 0.2378386 0.1753987 0.2022401 0.2785045 
> foo.pred$p0
        1         2         3         4         5 
0.1205351 0.1733806 0.2429933 0.3294175 0.4291541 

Now for coefficient extraction and the contrasts:

coef.McLerren <- function(model)
{
  temp1 <- coefficients(model$p.ygt0)
  temp2 <- coefficients(model$ygt0)
  names(temp1) <- NULL
  names(temp2) <- NULL
  retval <- c(temp1, temp2)
  names(retval) <- c("b0.f","b1.f","b0.h","b1.h")
  retval
}

contrast.McLerren <- function(b0_f, b1_f, b2_f, b0_h, b1_h, b2_h)
{
  (1-(1 / (1 + exp(-b0_f -b1_f))))*(exp(b0_h + b1_h)) - (1-(1 / (1 + exp(-b0_f -b2_f))))*(exp(b0_h + b2_h))
}


> coef.McLerren(foo)
      b0.f       b1.f       b0.h       b1.h 
 2.0819321 -1.8911883  1.0009568  0.1334845 
jbowman
  • 31,550
  • 8
  • 54
  • 107
  • 2
    You are correct with regard to what is happening with the "parts" (i.e logit regression for PR(y>0) and gamma regression for E(y|y>0) but it is the combined estimate (and standard errors, CI) that are of main interest - i.e E(y). Predictions of this quantity are made in the SAS code by (1 - p_yEQ0)*mu. This formulation allows you to conduct contrasts on the coefficients on this combined value. – B_Miner Jan 03 '12 at 02:32
  • @B_Miner - I've added some code + examples that partially address the prediction issue, thanks for pointing that out. – jbowman Jan 05 '12 at 17:30
  • Is this not just separate estimates though? In SAS, NLMIXED will give the abiity to estimate the point estimate of E(y) as well as a CI (using the delta method I believe). Also, you can write user defined contrasts of the parameters as I shown above to test linear hypothesis. There must be an R alternative? – B_Miner Jan 05 '12 at 22:14
  • Well, yes and no. To use the example, the returned `foo.pred$fit` gives the point estimate of E(y), but the component `foo.pred$pred.ygt0$pred` would give you E(y|y>0). I added in the standard error calculation for y, BTW, returned as se.fit. The coefficients can be gotten from the components by coefficients(`foo.pred$pred.ygt0`) and coefficients(`foo.pred$pred.p.ygt0`); I'll write an extraction routine and a contrast routine in a little while. – jbowman Jan 05 '12 at 22:49
  • Can you please describe where this comes from: se.fit – B_Miner Jan 06 '12 at 18:30
  • It is the formula for the std. deviation of a product of two random variables when the two variables are independent. In this case I make use of the fact that var(1-p0) = var(p0). The formula for the variance is (E(x)SD(y))^2 + (E(y)SD(x))^2 + (SD(x)SD(y))^2. – jbowman Jan 06 '12 at 18:46
  • Thanks a lot for such a helpful answer! I haven't been here for a while and ended up solving the problem by moving to non-parametric methods altogether. Nonetheless, I'm now studying your code with great interest and will very likely use it in the future. All the best, Mike – a11msp Mar 25 '13 at 19:35
  • PS. I can see that while PROC NLMIXED maximizes the likelihood jointly, in the R implementation you propose the models seem to be completely separate. You are saying however that while "the likelihood separates anyway, so you don't get improved parameter estimates as a result". Would you mind explaining this point a bit more? I suppose the joint likelihood is in the SAS code is estimated as " log(1 - p_yEQ0) - lgamma(theta) + (theta-1)*log(amount) - theta*log(r) - amount/r". What would be the equivalent way to get a single likelihood estimate for this model in R? Many thanks! – a11msp Mar 25 '13 at 19:43