Is an ANOVA applicable for these data?

Question

I have a data set from 7 groups, with 20 fish in each group. Measurement of a parameter is made on 25 cells from each fish (so each observation in the data-set is completely independent, right?). One of the groups functions as the control group while other 6 are treatment groups. So we have a total of 25*20*7 measurements. This is how the data looks like (a boxplot of all 7 groups is attached):

samples subjects groups response
    1        1      1     4.85
    2        1      1     3.77 ..
    25       1      1     4.71
    26       2      1     4.51 ..
    500      20     1     4.21
    501      1      2     4.11 ..
    3500     20     7     4.19

I wish to run an ANOVA and the expectation is that a couple of groups should differ from the control group in regards to the parameter under observation. Here are a few questions:

Is the following R code appropriate? (It shows there is no significant difference between groups.)

n = 20
k = 25
g = 7
subjects = gl(n,   k, n*k*g)
groups   = gl(g, n*k, n*k*g)

study1 = data.frame(c(1:(n*k*g)), subjects, groups, r11)
colnames(study1) = c("samples", "subjects", "groups", "response")

fit = lm(response~groups + samples*subjects, data=study1) # or aov?
anova(fit)

Analysis of Variance Table

Response: response
                   Df Sum Sq Mean Sq  F value    Pr(>F)    
groups              6  846.8 141.134 122.2864 < 2.2e-16 ***
samples             1   13.1  13.055  11.3114 0.0007787 ***
subjects           19  119.5   6.289   5.4493 2.078e-13 ***
samples:subjects   19  149.6   7.872   6.8206 < 2.2e-16 ***
Residuals        3454 3986.4   1.154                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The attached qqplot shows normal residues, however the Shapiro-Wilk test always fails on all groups. (Is my sample size of 3500 too big and problematic?)

shapiro.test(study1$response[study1$groups==1])

data:  study1$response[study1$groups == 1] W = 0.9818, p-value =
6.648e-06

And so does the Levene for equality of variance:

leveneTest(lm(response ~ groups, data=study1))
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    6   19.37 < 2.2e-16 ***
      3493

Please guide me as to how should I proceed. Should I keep on using ANOVA and disregard the fact that the normality and equality of variance assumptions are being violated? Should I remove the outliers from my data? Should I transform data somehow to be 'more' normal? Should I switch to non parametric or rank based tests? The end goal is to identify groups that differ significantly from the control group.

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

The qq-plot does not show normal residuals. You can see that the points have a shallow, concave-up curve and the ends move outside of the 95% confidence band (especially at the high end). Your data are more (positively) skewed than we would expect to find in a sample from a true normal distribution. In addition, your data are not independent since you have multiple samples from each fish and you have not properly accounted for that in your model. Thus, your data do not meet any of the three standard assumptions for an ANOVA (1. independence, 2. homoscedasticity, 3. normality).

That said, deviations from normality and homoscedasticity are easier to find with higher N, even if they are too trivial to worry about (see: Is normality testing 'essentially useless'?). Although I cannot tell how far your data are from a true normal, the plots appear to show fairly minor deviations. The ANOVA is pretty robust to these violations, especially if they are small or if you have a lot of data. You probably don't have to worry too much about these problems.

On the other hand, violations of independence tend to be more serious. It appears to me that you attempted to deal with this by having a samples variable with a unique value for each fish*sample combination in your dataset. This is incorrect on several levels. First, you need to group the samples within each fish, not have a unique value for each sample. Second, the variable should be a factor, not a numeric variable. Third, fitting this as a fixed effect is possible, but a poor strategy that loses power and most likely does not reflect the question you want to ask.

Instead, you need to fit a mixed effects model. That is, you need to have a factor variable fish.ID with 7x20=140 levels, where each of the 25 samples from a given fish are assigned to the same level of fish.ID. Then you would fit a random intercept for each level of the fish.ID variable. Under the assumption that the data are ordered by samples within the fish, the code might be something like this:

library(lme4)
set.seed(9308)

n = 20
k = 25
g = 7
groups   = gl(g, n*k, n*k*g)
r11      = rnorm(g*n*k)
fish.ID  = factor(rep(1:140, each=25))
study1   = data.frame(r11, groups, fish.ID)
colnames(study1) = c("response", "groups", "fish.ID")


lmm.fit = lmer(response~groups+(1|fish.ID), data=study1)
anova(lmm.fit)
# Analysis of Variance Table
#        Df Sum Sq Mean Sq F value
# groups  6 3.2844  0.5474  0.5383

Gung, What you suggested makes a lot of sense. May I seek more help in the direction of interpreting the results of lmm, more specifically obtaining a p-value and group identification for those that differ from control. `Fixed effects: Estimate Std. Error t value (Intercept) 5.0971 0.1436 35.50 groups2 0.7495 0.2030 3.69 groups3 -0.4539 0.2030 -2.24 groups4 -0.1903 0.2030 -0.94 groups5 0.3045 0.2030 1.50 groups6 -0.8071 0.2030 -3.98 groups7 -0.5140 0.2030 -2.53` — Apo, Aug 20 '15 at 20:25
@Apo, that is really a different Q--you might start a new thread. FWIW, this is a common Q. The default advice is a likelihood ratio test. That won't help in your situation. If your control group is set as the reference level / intercept, you could use the summary output (if it had p values) as long as you control for the multiple tests & their non-orthogonality (ie use Bonferroni). There is information about how to get p values [here](http://mindingthebrain.blogspot.com/2014/02/three-ways-to-get-parameter-specific-p.html). — gung - Reinstate Monica, Aug 20 '15 at 20:52
If I have 4 sets of measurements as descried above, 2 from fish head and 2 from body and tail, would the following formula be correct: response~groups+body_part+(1|fish.ID) — Apo, Aug 23 '15 at 17:12
one last question before I close this thread: would a multivariate like following apply: {resp_head,resp_body,resp_tail}~groups+body_part+(1|fish.ID). How would it differ from univariate? — Apo, Aug 23 '15 at 18:05
@Apo, that doesn't make any sense. You have body parts (in different forms) on both sides of the equation. What you had above is better. — gung - Reinstate Monica, Aug 23 '15 at 18:11

Is an ANOVA applicable for these data?

1 Answers1