Nested ANOVA, three way ANOVA, mixed model, or ANCOVA?

Question

I need to compare total concentrations of a chemical across different tissues of a plant species. I sampled multiple tissues from each individual, and sampled at multiple sites. I do not know how to handle the individual and site covariates. There is good reason to expect a site effect on chemical concentrations. I need to control for the fact that concentrations will be correlated across tissues within individuals, but am only interested in whether concentrations differ consistently across tissues. I.e. are leaf concentrations always the lowest, etc.

I have two questions -

Question 1 what it the most appropriate analysis to use?

I do not think I have enough data for a mixed effects model, and would like to avoid using one if possible.

So far, I've used: (1) a three way ANOVA with tissue, individual, and site as factors.

3wy_mod <- (lm(total_conc ~ tissue+indv+site, data=data))

However, I feel like that isn't correct. I think that the model structure should have tissue nested within individual nested within site. (2) to address the nesting issue, I've also used this model:

nest <- anova(lm(total_conc ~ tissue/indv/site, data=data))

The nested model seems correct. However, I'm not sure how to validate this model. This brings me to

Question 2

How do I assess model fit, check assumptions, etc with whichever model structure turns out to be appropriate?

Here is an attempt at a reproducible example:

data <- structure(list(indv = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 
7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 
11L, 11L, 11L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 
14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 
17L, 18L, 18L, 18L, 18L, 19L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 
21L, 21L, 21L, 21L, 22L, 22L, 22L, 22L, 23L, 23L, 23L, 23L, 24L, 
24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 26L, 26L, 26L, 26L, 26L, 
27L, 27L, 27L, 27L, 28L, 28L, 28L, 28L), tissue = structure(c(3L, 
5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 
5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 3L, 1L, 5L, 4L, 3L, 1L, 
5L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 
5L, 4L, 3L, 1L, 5L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 
1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 
1L, 4L, 3L, 1L, 5L, 2L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 4L, 3L, 
1L, 5L, 4L, 3L, 1L, 5L, 2L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L
), .Label = c("flower", "fruit", "leaf", "pollen", "stem"), class =          "factor"),
site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("a", 
"b", "c", "d"), class = "factor"), total_conc = c(8251.26, 
3845.59, 6681.23, 504.47, 21494.77, 13514.46, 14244.08, 1091, 
4156.25, 8301.19, 12968.21, 5536.14, 9034.05, 11401.89, 12010.11, 
2013.21, 14997.58, 12266.25, 23068.28, 4784.39, 9326.1, 5522.93, 
9814.8, 3467.77, 7107.22, 6129.18, 7330.3, 7242.1, 8668.38, 
5214.82, 1891.02, 5918.55, 11268.32, 3267.63, 3648.9, 13406.13, 
15847.31, 6775.97, 1781.21, 10943.45, 14477.16, 7890.34, 
1738.36, 14467.25, 11777.99, 5551.34, 1136, 5537.86, 18114.18, 
2895.41, 999.91, 3633.19, 7243.96, 4856.73, 8317.76, 15885.53, 
6811.57, 15818.11, 1316.81, 15347.71, 8068.19, 14688.62, 
2247.32, 17762.4, 6628.74, 16494.26, 11255.5, 19459.68, 5960.92, 
20607.73, 3226.06, 11572.8, 7924.07, 12091.07, 2934.25, 12734.17, 
5158.48, 16520.8, 1431.31, 14007.17, 4689.99, 23774.68, 5799.91, 
12903.34, 18982.81, 9125.87, 14869.07, 20022.23, 28254.47, 
9459.41, 1977.36, 22562.52, 22195.51, 12376.81, 2284.3, 1358.63, 
24866.6, 15789.27, 9544.07, 275.32, 17669.95, 18935.79, 9357.06, 
18818.68, 1676.3, 23224.24, 25432.66, 12328.59, 1367.51, 
24176.56, 26710.84, 21184.79, 1642.89)), .Names = c("indv", 
"tissue", "site", "total_conc"), class = "data.frame", row.names = c(NA, 
-113L))

3wy_mod <- anova(lm(total_conc ~ tissue+indv+site, data=data))

nest <- anova(lm(total_conc ~ tissue/indv/site, data=data))

A few questions before an answer: (1) Are individuals only at one site? Or can an individual come from multiple sites? (2) You mention not having enough data. How many sites are there? How many individuals do you have, per site (on average), and how many tissues do you have, per person (on average)? — Mark White, Nov 02 '17 at 16:25
One more: (3) What are your predictor variables of interest? And what are they characteristics of? Are they characteristics of the tissue, the individual, or the site? Or do you have predictors and covariates at all levels? — Mark White, Nov 02 '17 at 16:26
Can you provide a minimal dataset that shows the structure of your data? See here for how to do that: http://adv-r.had.co.nz/Reproducibility.html — Stefan, Nov 02 '17 at 16:39
@MarkWhite In answer to your questions: (1) Yes; (2) There are 4 sites, there are an average of 7 individuals per site, and there are a maximum of 5 tissue types sampled though we do not have samples of all tissues for all individuals; (3) The predictor variable of interest is tissue type. We are interested in ranking the tissue types with respect to concentrations of the chemical of interest (the response) — JKO, Nov 02 '17 at 19:24
@Stefan I've just uploaded what I hope is a reproducible example. — JKO, Nov 02 '17 at 19:40
As you do not have a balanced design (you don't have samples of all tissues for all individuals), the `anova(lm())` approach in R might not be appropriate. At best, the results from `anova()` may depend on the order of entering the variables into the model. A mixed-effects model might be necessary. — EdM, Nov 02 '17 at 20:57
@EdM - Point taken. However, am I correct in thinking that the lack of balance will only make the test more conservative? As it is, my effect size is pretty substantial and the p-values are very very low, so I'm not too concerned about using an overly conservative test. What are your thoughts on this? — JKO, Nov 02 '17 at 21:04
I agree with @EdM . I also don't see why you say you don't have enough data for a mixed model? Either `lmer(total_conc ~ tissue + (1|indv) + (1|site), data)`, or `total_conc ~ tissue + site (1|indv), data` works fine for me. — Stefan, Nov 02 '17 at 21:04
@Stefan - The sample data I included in the example are for one species. I am running separate analyses of the same kind for two other species and I have much less data for those (only 7 reps for one of those species). I would like to use the same analysis for all species. — JKO, Nov 02 '17 at 21:07
Again, it's hard to assess without the data for that specific case... You may have to build different models in order to accommodate "not enough data". Maybe the one-fits-all scenario doesn't work in this case?! What error do you get when you run the two models I provided on the different species? — Stefan, Nov 02 '17 at 21:20
@Stefan - I don't get an error when I run the model, but I get the following when I call summary() Error in calculation of the Satterthwaite's approximation. The output of lme4 package is returned summary from lme4 is returned some computational error has occurred in lmerTest *And I am only using individual as a random effect for this one, as we only had a single site for that species — JKO, Nov 02 '17 at 21:31
Maybe in that case you don't have enough data to compute the Satterthwaite's approximation for degrees of freedom (but I don't know that). Have a look at the `afex` package and the `mixed()` function. Then look at the `methods` argument where you can specify different methods for obtaining p-values. — Stefan, Nov 02 '17 at 21:52
@Stefan @EdM @JKO it doesn't seem like there would be enough data for a three-level multilevel model, imo. Just because it runs doesn't mean you are getting valid estimates, right? It isn't a perfect solution, but I would suggest doing a Bayesian multilevel model. Even weakly informative priors could nudge the estimation process along. If you don't know Stan, but know `lme4`, the `brms` package specifies models just like `lme4`, but has extra arguments for priors. You need a C++ compiler for it, but it's a fantastic package for multilevel models. — Mark White, Nov 02 '17 at 22:54

score 3 · Accepted Answer · answered Nov 04 '17 at 16:08

The crossed, non-nested model is incorrect, as you suspect, and seems to be over-specified for your posted data. It's important to get the nesting correct, which can be confusing. This page is a good guide. With multiple individuals per site and each individual restricted to one site, the individuals are nested within site. If you were to use your anova(lm()) approach you should be writing:

total_conc ~ tissue/site/indiv

which is expanded to:

total_conc ~ 1 + tissue + tissue:site + tissue:site:indiv

So this model examines the main effect of tissue plus its interactions with site and with site:indiv, without main effects for site or indiv. The result of your anova(lm()) on your posted data* would then be:

Analysis of Variance Table

Response: total_conc
                 Df     Sum Sq   Mean Sq  F value  Pr(>F)  
tissue            4 3023661191 755915298 1288.243 0.02089 *
tissue:site      12 1280847591 106737299  181.903 0.05788 .
tissue:site:indv 95 1325836886  13956178   23.784 0.16203  
Residuals         1     586780    586780

You only have 1 residual degree of freedom because most cases have only one measurement on each tissue from an individual. So the three-way interaction is fairly meaningless. Although results from the Type I sums of squares used by anova(), when applied to unbalanced data, can depend on the order of entry of variables into the model formula, in this case there really is only one order of entry that makes sense, so that's not such an issue.

These data are highly unbalanced, with observations on fruit restricted to 2 indiv at one particular site. If that's characteristic of your real data, then you probably should consider removing such tissues with low numbers of observations from your analyses. Would you (or a skeptical reviewer) really trust analyses based on 2 observations from a single site? Removing the fruit observations from these data, at least, would leave a reasonably balanced design that would probably be good enough for standard nested ANOVA.

Note that both the nested ANOVA approach and mixed models with intercept-only terms like (1|indiv) are making an implicit assumption that all effects of individuals and sites on total_conc are additive without regard to the tissue being evaluated. You must use your knowledge of the subject matter as to the validity of that assumption. With the raw means of total_conc ranging from 2804 (pollen) to 16844 (fruit) I would be a bit worried about that assumption. In principle the formalism of a mixed model could incorporate random effects that differ among tissue values, but you don't seem to have enough data for that. The suggestion from @MarkWhite to try a Bayesian multilevel model might help, but I have no experience with such models.

If there is some irreducible imbalance in the data then I think that many would argue for a mixed-model approach. My guess is that the errors that you discover with mixed models due to small numbers of observations would also be seen in the anova(lm()) approach if you dug more deeply. For example, that approach on the crossed model applied to your data provides what seems to be a perfectly reasonable ANOVA table.

> anova(lm(total_conc ~ tissue+site+indv, data=data))
Analysis of Variance Table

Response: total_conc
          Df     Sum Sq   Mean Sq F value    Pr(>F)    
tissue     4 3023661191 755915298  47.967 < 2.2e-16 ***
site       3  746832764 248944255  15.797 3.552e-08 ***
indv      24  583953341  24331389   1.544   0.07733 .  
Residuals 81 1276485152  15759076

Nevertheless, the underlying crossed model can't define 3 regression coefficients because of singularities:

> summary(lm(total_conc ~ tissue+site+indv, data=data))

Call:
lm(formula = total_conc ~ tissue + site + indv, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9199.3 -2226.1   -84.8  1890.3  9891.5 

Coefficients: (3 not defined because of singularities)

*Note that the indiv values in the posted data need to be converted to factors first.

Many thanks for your answer, it addressed my question exactly and thoroughly. I have indeed dropped fruit from all analyses. — JKO, Nov 04 '17 at 22:51

Nested ANOVA, three way ANOVA, mixed model, or ANCOVA?

1 Answers1