Mann-Whitney with many zeros

Question

I'm running a basic Mann-Whithney U test in R between two groups; each group represents the abundance values of a particular bacteria within an animal. Therefore a 0 means that bacteria is not present within the animal:

a <- c(0,0,12,0,0,76,0,0,81,0,0,0,0,0)
b <- c(427,928,0,127,0,0,189,0,0,0,0,0,312,583,0)
wilcox.test(a,b,exact=FALSE)

The returned p-value is 0.1362, however I would have expected it to be <0.05 since the two groups are quite different (at least IMHO). I take it that the abundance of zeros is causing this.

Is there another test to use in order to check whether these two group are different? It looks like a zero-inflated negative binomial regression was suggested here, however I'm not familiar with that test or whether it applies to my data set.

Can I simply omit the zeros? wilcox.test(a[a!=0],b[b!=0],exact=FALSE) yields a p-value of 0.02, but I'm not sure if this is a good approach.

UPDATE

Given tristan's update, I've looked into zero-inflated count data regression. I'm not sure if I'm running things properly (since I'm new to the models) but I'll post code and results:

library(pscl)
library(lmtest)
df<-data.frame('Abundance'=append(a,b))
df$Group <- 'groupA'
df[(length(a) + 1):(length(a) + length(b)),2] <- 'groupB'
summary(m1 <- zeroinfl(Abundance ~ Group | Group, data = df))
summary(mnull <- update(m1, . ~ 1))
lrtest(m1, mnull)

Returns:

Call:
zeroinfl(formula = Abundance ~ Group | Group, data = df)

Pearson residuals:
      Min        1Q    Median        3Q       Max 
-0.864258 -0.864258 -0.728962 -0.002854  4.105212 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.53223    0.07647   46.19   <2e-16 ***
GroupgroupB  2.52612    0.07898   31.98   <2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.5878     0.5578   1.054    0.292
GroupgroupB  -0.3001     0.7764  -0.387    0.699
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 10 
Log-likelihood:  -655 on 4 Df
> summary(mnull <- update(m1, . ~ 1))

Call:
zeroinfl(formula = Abundance ~ 1, data = df)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.8018 -0.8018 -0.8018 -0.1681  6.8098 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.51672    0.01911   288.6   <2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.4353     0.3870   1.125    0.261
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 8 
Log-likelihood: -1705 on 2 Df
> lrtest(m1, mnull)
Likelihood ratio test

Model 1: Abundance ~ Group | Group
Model 2: Abundance ~ 1
  #Df   LogLik Df  Chisq Pr(>Chisq)    
1   4  -654.97                         
2   2 -1705.50 -2 2101.1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To answer it is important to know your data generating process. Are these zeroes to be taken seriously? You also don't have that many data points so it's not surprising if you struggle to reach statistical significance... — tristan, Oct 09 '15 at 14:49
edited question explaining the data. 0 means the bacteria was not present in the animal. — Constantino, Oct 09 '15 at 15:05
Okay, thanks for explaining. A zero-inflated model sounds appropriate in this case, but it's not immediately clear whether a zero-inflated poisson or zero-inflated negative binomial model would be more appropriate. — tristan, Oct 09 '15 at 15:19
Just wanted to add that the problem with the Mann-Whitney U test for your data is that it compares the ranks of the points in each group and not the actual value. Hence it does not take into account that group B's animals with the bacteria have bacteria in way higher abundance than those in group A. The test only accounts that it's "higher" but not "how much higher", so it doesn't see as much difference between the two groups as you do. — Sheep, Oct 09 '15 at 17:49
That's a good point @Sheep - is there a test better suited that takes into account the actual value? — Constantino, Oct 09 '15 at 17:59
I don't know of a simple test, but Tristan's idea sounds promising. — Sheep, Oct 09 '15 at 18:31

score 2 · Accepted Answer · answered Oct 09 '15 at 15:40

I know you are working in R, but using Stata:

. zip abundance, inflate(group)

Fitting constant-only model:

Iteration 0:   log likelihood = -2482.9754  
Iteration 1:   log likelihood = -1171.7522  
Iteration 2:   log likelihood =   -1166.32  
Iteration 3:   log likelihood = -1166.3055  
Iteration 4:   log likelihood = -1166.3055  

Fitting full model:

Iteration 0:   log likelihood = -1166.3055  
Iteration 1:   log likelihood = -1166.3055  

Zero-inflated Poisson regression                Number of obs     =         29
                                                Nonzero obs       =          9
                                                Zero obs          =         20

Inflation model = logit                         LR chi2(0)        =       0.00
Log likelihood  = -1166.305                     Prob > chi2       =          .

------------------------------------------------------------------------------
   abundance |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
abundance    |
       _cons |   5.716662   .0191215   298.97   0.000     5.679185     5.75414
-------------+----------------------------------------------------------------
inflate      |
       group |  -.8938181   .8378665    -1.07   0.286    -2.536006    .7483701
       _cons |   1.299283    .651339     1.99   0.046     .0226822    2.575884
------------------------------------------------------------------------------

. estimates store restricted

. zip abundance group, inflate(group)

Fitting constant-only model:

Iteration 0:   log likelihood = -2482.9754  
Iteration 1:   log likelihood = -1171.7522  
Iteration 2:   log likelihood =   -1166.32  
Iteration 3:   log likelihood = -1166.3055  
Iteration 4:   log likelihood = -1166.3055  

Fitting full model:

Iteration 0:   log likelihood = -1166.3055  
Iteration 1:   log likelihood = -658.16134  
Iteration 2:   log likelihood = -575.99026  
Iteration 3:   log likelihood = -574.37859  
Iteration 4:   log likelihood =  -574.3781  
Iteration 5:   log likelihood =  -574.3781  

Zero-inflated Poisson regression                Number of obs     =         29
                                                Nonzero obs       =          9
                                                Zero obs          =         20

Inflation model = logit                         LR chi2(1)        =    1183.85
Log likelihood  = -574.3781                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
   abundance |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
abundance    |
       group |   2.027058   .0794158    25.52   0.000     1.871406     2.18271
       _cons |   4.031286   .0769231    52.41   0.000      3.88052    4.182053
-------------+----------------------------------------------------------------
inflate      |
       group |  -.8938179   .8378665    -1.07   0.286    -2.536006    .7483702
       _cons |   1.299283   .6513389     1.99   0.046     .0226821    2.575884
------------------------------------------------------------------------------

. estimates store full

. lrtest full restricted, stats

Likelihood-ratio test                                 LR chi2(1)  =   1183.85
(Assumption: restricted nested in full)               Prob > chi2 =    0.0000

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
  restricted |         29 -1166.305  -1166.305       3    2338.611   2342.713
        full |         29 -1166.305  -574.3781       4    1156.756   1162.225
-----------------------------------------------------------------------------
               Note: N=Obs used in calculating BIC; see [R] BIC note.

This demonstrates that using the zero-inflated poisson model with group (A vs B) predicting the inflation, it is significantly better for model fitting to include group to predict abundance (as indicated by likelihood ratio test and by highly significant coefficient in regression).

Can one then associate a p-value for the group comparison A vs B? — Constantino, Oct 12 '15 at 15:09
That is the p-value for group under abundance in the full model :) — tristan, Oct 12 '15 at 15:42

Mann-Whitney with many zeros

1 Answers1