Both variables of my GLMM output are significant. Don't know how to interpret it?

Question

This is more of an interpretation question than anything. I have run a GLMM with two fixed factors (both of which have two levels) and two random factors. The outputs from the model are as such:

Fixed effects:
                      Estimate Std. Error z value Pr(>|z|) 

(Intercept)            2.46847    0.31386   7.865 3.69e-15 ***

data_f$Treatment2      1.41217    0.20681   6.829 8.58e-12 ***

data_f$site1          -0.09861    0.33342  -0.296    0.767

What I don't understand is how the intercept and treatment2 can be significant since I am comparing count data and I think I am testing whether there is a significant difference between the two treatment types. If this is the case how can there be significant activity at both?

As suggested I have added the boxplot:

logged boxplot:

full function:

data_f$Count ~ data_f$treatment.type + data_f$site (1 | data_f$count_location)

It may help you to look at the descriptive statistics (and add the resulting graph here in your question). Say you use the command `boxplot(data_f$y~data_f$Treatment2+data_f$site1,ylim=c(0,4))` (where you have to replace the 'y' by the name of your dependent variable) then I expect: 1) a boxplot with four (2x2) levels, 2) where the difference in the means for different site_1 will be small (and negative), 3) the difference in the means for different treatment2 will be large (significant), and 4) **all of the four groups will be different from 0** (*also* significant). — Sextus Empiricus, Oct 18 '17 at 12:32
You are surprised about the significant intercept. So my question to your question is... do you not understand why all of the four groups are different from zero, or do you not understand the interpretation of the (significant) model intercept parameter. (this is an interesting question for different cases since the significant intercept is not necessarily reflecting all groups being different from zero for instance compare `summary(lm(c(1,1,0,0)~c(0,0,1,1.1)))` and `summary(lm(c(1,1,0,0)~c(1,1.1,0,0)))`) — Sextus Empiricus, Oct 18 '17 at 12:40
Hi @MartijnWeterings thank you very much for your comments. I have added the boxplot as suggested. I had not thought to look at the descriptives. Am I right in thinking from the graph and what you said that the model outputs are saying that there are significantly more counts at HMB (site2) compared with RF(site1) and that there are significantly more counts at hedges (treatment2) than fields (treatment1) not that fields have a positive significant effect on counts? — DFinch, Oct 18 '17 at 13:22
I will post you an answer. Can you in the meantime make the boxplot with a logarithmic scale for a better view. Also, the answer to the question is about (simple) linear models in general... yet it would be nice if you place the full function call to the glmm model (if it is not too complicated), such that it will be clear to viewers that you have additional random effects and a Poisson regression. — Sextus Empiricus, Oct 18 '17 at 13:40

score 1 · Accepted Answer · answered Oct 18 '17 at 16:07

It was very useful that you have plotted your data. Since your model interpretation can vary based on how you set the formula.

Your model uses the formula:

$$log(y) = \beta_1 + \beta_2 \text{ treatment} + \beta_3 \text{ site}$$ which effectively becomes a vector equation (expressing each level)

$$log(y) = \left\{ \begin{array}{@{}ll@{}} \beta_1, & \text{if 'treatment = field' and 'site = HMB'}\\ \beta_1+\beta_2, & \text{if 'treatment = Hedge' and 'site = HMB'}\\ \beta_1+\beta_3, & \text{if 'treatment = field' and 'site = RF'}\\ \beta_1+\beta_2+\beta_3, & \text{if 'treatment = Hedge' and 'site = RF'}\\ \end{array}\right. $$

where I estimate that the values that come second in your boxplot are coded with the level 1 and are used in those if-statements to differentiate from the intercept $\beta_1$.

This scheme can be changed in all kind of ways and can have strong differences. See for instance the switch of labels in the example below:

> summary( lm( c(1,1.1,0,0) ~ 1 + c(0,0,1,1)))$coefficients
               Estimate     Std. Error  t value     Pr(>|t|)
(Intercept)    1.05         0.03535534  29.69848    0.001131862  **
c(0, 0, 1, 1) -1.05         0.05000000 -21.00000    0.002259890  **
> summary( lm( c(1,1.1,0,0) ~ 1 + c(1,1,0,0)))$coefficients
               Estimate     Std. Error  t value     Pr(>|t|)
(Intercept)   -2.220446e-16 0.03535534 -6.28037e-15 1.00000000
c(1, 1, 0, 0)  1.050000e+00 0.05000000  2.10000e+01 0.00225989   **

In your case the image below explains two effects in the results:

Because you are not using a cross term, the difference between treatment groups Field and Hedge is estimated to be the same for both site groups HMB and RF (or vice versa). You can see this by the angle of the blue dotted lines being the same in the graph. Yet we see that the variation in effect a is larger in the one group of effect b compares to the other group of effect b (you can replace the labels a and b by treatment and site in any order). This means that the effects sizes are being underestimated for the one group and overestimated for the other group (this partly explains why the means do not match in the image, the other part of the explanation is that the bars in the boxplot are not means but medians and the data is skewed).
The intercept is a relative term, and depends on where you place the origin. And analogous to a typical linear curve fit you can place this origin anywhere you want. See the image below which places the origin in the lower left corner, but you could choose any other:

Important here is that you look at the image and define a sensible idea about the relationship (or possibly in advance if theory allows you to do this, for instance a sensible choice would be to demand the origin to be in between the sites and at point of no treatment, in that case the $\beta_2$ means the effect size and $\beta_3$ the contrast between the sites).

It is only for particular cases (when the intercept is an important term) that you may wish to think more deeply about the position of the intercept/origin.

I personally, if I want a quick and simple result, and I am not so much bothered with these nuances, intercept stuff etcetera, then I use a graphical interpretation, with the Anova (or other statistical test) just as numerical measure to what the eyes already see.

See also in the next piece of code for a demonstration of the arbitrariness of the origin/intercept:

set.seed(1)
> x1 <- c(1,1,1,1,0,0,0,0)
> x2 <- c(1,1,0,0,0,0,1,1)
> y <- x1+0.5*x2+c(0.6,0.5,0,0,0,0,0,0)+rnorm(8,0,0.5)
> 

> summary(lm(y ~ 1+ factor(x1,levels=c(0,1)) + factor(x2,levels=c(0,1))))$coefficients
Estimate Std. Error    t value   Pr(>|t|)
(Intercept)                   -0.07779159  0.2703511 -0.2877428 0.78508880
factor(x1, levels = c(0, 1))1  1.22275607  0.3121746  3.9168984 0.01121690 *
factor(x2, levels = c(0, 1))1  0.83928146  0.3121746  2.6885004 0.04337644 *

> summary(lm(y ~ 1+ factor(x1,levels=c(0,1)) + factor(x2,levels=c(1,0))))$coefficients
Estimate Std. Error   t value   Pr(>|t|)
(Intercept)                    0.7614899  0.2703511  2.816670 0.03725437 *
factor(x1, levels = c(0, 1))1  1.2227561  0.3121746  3.916898 0.01121690 *
factor(x2, levels = c(1, 0))0 -0.8392815  0.3121746 -2.688500 0.04337644 *

> summary(lm(y ~ 1+ factor(x1,levels=c(1,0)) + factor(x2,levels=c(0,1))))$coefficients
Estimate Std. Error   t value    Pr(>|t|)
(Intercept)                    1.1449645  0.2703511  4.235102 0.008208024 **
factor(x1, levels = c(1, 0))0 -1.2227561  0.3121746 -3.916898 0.011216902 *
factor(x2, levels = c(0, 1))1  0.8392815  0.3121746  2.688500 0.043376437 *

> summary(lm(y ~ 1+ factor(x1,levels=c(1,0)) + factor(x2,levels=c(1,0))))$coefficients
Estimate Std. Error   t value     Pr(>|t|)
(Intercept)                    1.9842459  0.2703511  7.339515 0.0007366259 ***
factor(x1, levels = c(1, 0))0 -1.2227561  0.3121746 -3.916898 0.0112169024 * 
factor(x2, levels = c(1, 0))0 -0.8392815  0.3121746 -2.688500 0.0433764368 *

note: in the case of an additional cross term the position of the origin not only influences the intercept term, but also the effect sizes.

another note: with a post-hoc test, in which you make pairwise comparisons of the predicted values for the groups (and don't bother anymore about the model parameters), you can avoid all this interpretation stuff

wow thank you, Your in-depth answer is greatly appreciated. I have a couple of follow up question. you mention that - effects sizes are being underestimated for the one group and overestimated for the other group - is this a major issue and if o is there a way to get around it? you also mentioned that you would use the graph to interpret the results and just use the stats as a numerical measure - am I right in thinking that there is sig higher count on the fields over all? out of interest how might one 'demand the origin to be in between the sites and at point of no treatment' in my models? — DFinch, Oct 19 '17 at 08:43
I mentioned this thing about looking at the graph, but it is tricky comment. For hypothesis testing it is important that you are not going to try a lot of things and have your statistical test defined from the beginning. However in an exploratory phase of your research (Which you have to be in now because you are currently: puzzling what is actually your model and how could you describe your data, and next: you can test whether your ideas are correct. So by default your results are poisoned by over-interpretation and can not be a strong proof, and are just an indication).... — Sextus Empiricus, Oct 19 '17 at 08:53
...You seem to got hedge>field and hmb>rf (with the combination hedge+, although hmb being even more higher, cross term, so hmb>rf is stronger in the field group, or alternatively said hedge>field is stronger in the hmb group). (this is not necessarily a mayor issue if you do not get this combination term, it depend on what your story is). Note: just comparing averages may not provide the entire picture. In the field treatment group you have lots of zero counts. So the shape of the distribution tells you also something (and again it depends on your story, if and how you include this). — Sextus Empiricus, Oct 19 '17 at 08:57
In case of a 2x2 design: You could set the levels for the sites to (1,-1) to "center" the origin. And put the levels for the treatments to (0,1) with the zero for the non-treatment. Then you get: $$log(y) = \left\{ \begin{array}{@{}ll@{}} \beta_1-\beta_3, & \text{if 'treatment = field' and 'site = HMB'}\\ \beta_1-\beta_3+\beta_2, & \text{if 'treatment = Hedge' and 'site = HMB'}\\ \beta_1+\beta_3, & \text{if 'treatment = field' and 'site = RF'}\\ \beta_1+\beta_3+\beta_2, & \text{if 'treatment = Hedge' and 'site = RF'}\\ \end{array}\right. $$ — Sextus Empiricus, Oct 19 '17 at 09:01
I agree and had thought about using the current models form the start. I have six other species that I am looking at and all of these showed clear outputs that I think I understand. I did get confused by the output of this species though. Hence the questions. The trend with the other spp. are that hedge>field. Ecologically, I do expect more zero's in the field group. From a coding point of view, how would that be written? — DFinch, Oct 19 '17 at 09:13
Post-hoc testing can vary. I advise you too learn a bit about it before believing my advice. It depends a lot on what you want. You could just determine two times, with a t-test, the difference between the treatments (one time for each site), which may be the only relevant thing for you. Or you could use one test out of a range of different tests that make multiple comparisons (where it depends a bit whether you want to have a clear view of significance, which can be corrected in many ways, or just get an idea of the differences). For *simple exploratory* impression I prefer Fisher's LSD test. — Sextus Empiricus, Oct 19 '17 at 09:15

Both variables of my GLMM output are significant. Don't know how to interpret it?

1 Answers1

Linked