5

I have data from a field survey. The objective of the study is to relate number of seedling (respond variable, count data), landform (exploratory variable, categorical variable with 3 levels) and percent canopy coverage (exploratory variable, quantitative). In each habitat, I have data from five 25x25 meter plots. Within each plot I used three 2x2 meter subplots nested within the bigger plot, and number of seedlings were count from these subplots. Total number of observations is 60; 20 plots x 3 subplots. Only one kind of landform contained seedlings. Other two landforms contained no seedlings, so they are empty plots. the data looks like this:

data.frame':    60 obs. of  5 variables:
 $ plot         : Factor w/ 20 levels "HD2","LC1","LC2",..: 16 16 16 17 17 17 12 12 12 9 ...
 $ subplot      : Factor w/ 60 levels "HD2.1","HD2.2",..: 46 47 48 49 50 51 34 35 36 25 ...
 $ av.canopy    : num  92.2 92.2 92.2 92.3 92.3 ...
 $ landform     : Factor w/ 3 levels "abandoned","ridge",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ seedling.2016: int  6 7 5 2 5 4 4 6 4 0 ...

The problem is when I compared number of seedlings between landforms using this code:

seedling <- glmer(seedling.2016 ~ landform +(1|plot), family = poisson)

The result does not make sense for me-there were no any significant different between landforms event there is only one landform (ridge) that has seedlings, while other had no seedlings at al. It is also note that SEs are enormous. The result looks like this:

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: poisson  ( log )
Formula: seedling.2016 ~ landform + (1 | plot)
   Data: streblus.subplots

     AIC      BIC   logLik deviance df.resid 
   294.9    303.3   -143.5    286.9       56 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.3619 -0.0001 -0.0001  0.0000  8.7305 

Random effects:
 Groups Name        Variance Std.Dev.
 plot   (Intercept) 2.637    1.624   
Number of obs: 60, groups:  plot, 20

Fixed effects:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)    -20.412   1461.267  -0.014    0.989
landformridge   22.250   1461.265   0.015    0.988
landformtemp     1.066    390.540   0.003    0.998

When I changed link function to square root as this code:

Seedling2 <- glmer(seedling.2016 ~ landform +  (1|plot), family = poisson(link = sqrt))
Fixed effects:
#Estimate      Std. Error z value Pr(>|z|)    
#(Intercept)   -1.220e-05  5.296e-01   0.000        1    
#landformridge  3.250e+00  7.429e-01   4.376 1.21e-05 ***
# landformtemp   1.018e-05  7.795e-01   0.000        1  

Now number of seedlings in ridge is significantly higher than the other, and it makes more sense to me.

My question is: Is it valid in term of statistics to use square root link with Poisson distribution, there are any better methods available with better ground of statistics?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Ponlawat
  • 51
  • 4

2 Answers2

4

It looks very much like you have a case of complete separation:

  • there is only one landform (ridge) that has seedlings, while other had no seedlings at al

  • large estimates ($|\hat \beta|>10$), and ridiculously large standard error estimates.

Basically what's happening is that the baseline level ("abandoned") has an expected number of counts equal to zero for all plots, so the intercept $\beta_0$ - which is the expected log(counts) for the baseline level - should be estimated as $-\infty$ ... which messes up the Wald estimation of the uncertainty (the approximate, fast method that summary() uses).

You can read more about complete separation elsewhere; it is more typically discussed in the context of logistic regression (in part because logistic regression is more widely used than count regression ...)

Solutions:

  • your square-root-link solution is reasonable (in this case the intercept is expected $\sqrt{\textrm{counts}}$ in the baseline level, which is zero rather than $-\infty$); it will change the assumed distribution of the random effects slightly (i.e., Normal on the square-root rather than on the log scale), but that wouldn't worry me too much. If you had continuous covariates or interactions in the model, it would also change the interpretation of the fixed effects.
  • you could use some kind of penalization (most conveniently in a Bayesian framework), as described in my answer to the linked question (and here, search for "complete separation") to keep the parameters reasonable.
Ben Bolker
  • 34,308
  • 2
  • 93
  • 126
  • 1
    Ben, this does not affect your excellent answer, but should the final model (whatever it is) include a random effect for subplot as well? – Isabella Ghement Oct 26 '18 at 15:32
  • 2
    probably. There are 60 observations and 60 levels of the subplot factor, so this will be an observation-level random effect, which will essentially (try to) account for overdispersion. (Alternatively one could fit a negative binomial model.) – Ben Bolker Oct 26 '18 at 20:48
  • Thank you very much @BenBolker I have checked the link for common separation. I am just not quit sure where the 9 in the "fixef.prior = normal(cov = diag(9,4)))" in the example come from. – Ponlawat Oct 30 '18 at 10:16
  • 1
    it's the variance of the prior on the fixed effects (i.e. the standard deviation is 3, which is moderately large for a parameter on the logit scale) – Ben Bolker Oct 30 '18 at 11:48
4

Indeed this seems to be a separation issue. To account for these cases, in my GLMMadaptive package you can include a penalty for the fixed-effects coefficients in the form of a Students-t density (i.e., for large enough df equivalent to ridge regression). For a worked example, have a look at the last section of this vignette.

Dimitris Rizopoulos
  • 17,519
  • 2
  • 16
  • 37
  • I have tried the package using family = poisson(), penalized = TRUE, and it is error. But when I tried family = poisson(link=sqrt), it works. – Ponlawat Oct 30 '18 at 10:24
  • It would help if I could see the error or better have data that reproduce it. – Dimitris Rizopoulos Oct 31 '18 at 08:57
  • https://github.com/MrPonlawat/Streblus/blob/master/stre.question.csv plase find the link for the data. Thank you very much for your time and effort. – Ponlawat Nov 01 '18 at 10:27
  • It works for me when you set the initial values to 0, i.e., `mixed_model(seedling.2016 ~ landform, random = ~ 1 | plot, data = stre_question, family = poisson(), initial_values = list(betas = c(0, 0, 0)), penalized = TRUE)` – Dimitris Rizopoulos Nov 06 '18 at 08:03
  • Thank you very much. It works now. I just wondering about what the zeros in "list(betas = c(0, 0, 0))" means. I read the vignette but still do not understand. – Ponlawat Nov 13 '18 at 12:50
  • This set the initial values for the fixed effects at zero, which is here required because the default is to get the initial values from a Poisson log linear model ignoring the correlations, but this also suffers from the separation effect. – Dimitris Rizopoulos Nov 14 '18 at 06:22