How to perform ANOVA followed by a post-hoc for model comparison?

Question

My data is the measurement of 4 type of samples with three different methods where I want to see the interaction and main effect, compare them from each other. The data look like as follows:

df<- structure(list(color = structure(c(3L, 4L, 3L, 4L, 4L, 4L, 4L, 
    4L, 4L, 4L), .Label = c("B", "G", "R", "W"), class = "factor"), 
        type = 1:10, X1 = c(0.006605138, 0.001165448, 0.006975109, 
        0.002207839, 0.00187902, 0.002208638, 0.001199808, 0.001162252, 
        0.001338847, 0.001106317), X2 = c(0.006041392, 0.001639298, 
        0.006140877, 0.002958169, 0.002744017, 0.003107995, 0.001729594, 
        0.001582564, 0.001971713, 0.001693236), X3 = c(0.024180351, 
        0.002189061, 0.027377442, 0.002886651, 0.002816333, 0.003527908, 
        0.00231891, 0.001695633, 0.00212034, 0.001962923)), .Names = c("color", 
    "type", "X1", "X2", "X3"), row.names = c(NA, 10L), class = "data.frame")

The main questions I have are as follows:

1- Whether the first column X1 is significantly different from second (X2) or third (X3) and so on.

2- How can I see the interaction between each sample (they have 4 types: W, B, R and G).

3- Should I perform a one-way ANOVA or N-way ANOVA?

I have searched, but so far I was not successful in finding a suitable and representative way to do it in R or Matlab. I also want to plot the interaction as well.

Do you want to compare all columns (e.,g X1 vs X2, X2 vs X3) or only compare data to column 1 (e.g., X1 vs X2 X1 vs X3)? Or, to help you answer my question (and your question 3): What is your research question? — Richard Erickson, Sep 28 '15 at 13:44
@Richard Erickson I would consider both ways in case it is doable! — Learner, Sep 28 '15 at 13:46
Brian, in your longer example, why do you have `X1`, `X2`, and `X3` as structured factors? The should be numeric vectors like your short example. If you fix that, your longer example should run. — Richard Erickson, Sep 30 '15 at 23:21
@Brian, Out of curiosity, why didn't you award your bounty to my Answer? — Richard Erickson, Oct 06 '15 at 20:10
@Richard Erickson thanks for your message. actually you spent so much time and i appreciate all your time. however, it is not complete, I had three questions, the 3 one you replied but not complete, for example look at the p values, X2:R-X1:R and #X3:W-X1:W and #X3:W-X2:W. which one is really significant? how can one says statistically significant ? lets we have only one color, how can one says a sample is different or there is no differences?? summary(simpleAov) shows all the questions are significant , then what ??? what can be drive from it? — Learner, Oct 06 '15 at 21:21

score 2 · Accepted Answer · edited May 23 '17 at 12:39

To answer your 3rd question: I would use a n-way ANOVA because you have multiple factors you wish to compute. A general stats books such as Zar's Biostatistical Analysis or Sokal and Rohlf's Biometry will provide more information on different types of ANOVAs. I would also recommend you look through a book like this to make sure you understand the underlying statistics you are asking about.

To answer your coding questions: I did a quick search for "n-way anova r", which pointed me towards an R-bloggers post and a linked tutorial. (I included this detail so you can see how I solved your problem so that you may better help yourself in the future). I've summarized this tutorial for my answer and adapted it for your data example.

First, I organize your data using the melt function from the reshape2 package (this changes your results from "short form" to "long form").

library(reshape2)
df2 <- melt(df, id.vars = c("color", "type"))

Second, I ran an ANOVA using the aov function in R. The * symbol denotes an interaction and the formula may be read as value is predicted by variable, color, and the interaction between the two parameters:

simpleAov <- aov(data = df2, value ~ variable * color)
summary(simpleAov)
#                Df    Sum Sq   Mean Sq F value   Pr(>F)    
# variable        2 0.0001260 0.0000630   123.7 2.29e-13 ***
# color           1 0.0005636 0.0005636  1106.7  < 2e-16 ***
# variable:color  2 0.0003766 0.0001883   369.8  < 2e-16 ***
# Residuals      24 0.0000122 0.0000005                     
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The first statistical test asks if the amount of variance explained by the variable is significantly different than that explained by chance alone, which it is. The second test asks the same question for color. The third test examines if there is an interaction between variable and color. Your data has an interaction between the two. If no interaction was present, you could drop the interaction term from your model. Note that your example data only has 2 color, hence there is only 1 degree of freedom for color, even though you ask about 4 colors.

Third, I ran a post-hoc analysis to compare difference between groups. I used the Tukey's Honest Significant Difference test. Other tests exist, and you may wish compare the assumptions in a book such as Zar's (some are more conservative than other with respect to P-value corrections).

TukeyHSD(simpleAov)
#  Tukey multiple comparisons of means
#    95% family-wise confidence level

#Fit: aov(formula = value ~ variable * color, data = df2)

#$variable
#              diff           lwr         upr     p adj
#X2-X1 0.0003760439 -0.0004209547 0.001173043 0.4771586
#X3-X1 0.0045227136  0.0037257150 0.005319712 0.0000000
#X3-X2 0.0041466697  0.0033496711 0.004943668 0.0000000

#$color
#          diff         lwr         upr p adj
#W-R -0.0108362 -0.01150846 -0.01016393     0

#$`variable:color`
#                   diff           lwr          upr     p adj
#X2:R-X1:R -0.0006989890 -0.0029054888  0.001507511 0.9200311
#X3:R-X1:R  0.0189887730  0.0167822732  0.021195273 0.0000000
#X1:W-X1:R -0.0052566024 -0.0070009937 -0.003512211 0.0000000
#X2:W-X1:R -0.0046118002 -0.0063561915 -0.002867409 0.0000003
#X3:W-X1:R -0.0043504036 -0.0060947949 -0.002606012 0.0000008
#X3:R-X2:R  0.0196877620  0.0174812622  0.021894262 0.0000000
#X1:W-X2:R -0.0045576134 -0.0063020047 -0.002813222 0.0000004
#X2:W-X2:R -0.0039128112 -0.0056572025 -0.002168420 0.0000049
#X3:W-X2:R -0.0036514146 -0.0053958059 -0.001907023 0.0000147
#X1:W-X3:R -0.0242453754 -0.0259897667 -0.022500984 0.0000000
#X2:W-X3:R -0.0236005732 -0.0253449645 -0.021856182 0.0000000
#X3:W-X3:R -0.0233391766 -0.0250835679 -0.021594785 0.0000000
#X2:W-X1:W  0.0006448021 -0.0004584478  0.001748052 0.4800838
#X3:W-X1:W  0.0009061988 -0.0001970512  0.002009449 0.1521552
#X3:W-X2:W  0.0002613966 -0.0008418533  0.001364647 0.9757967

Looking through these results, you may see which groups differ.

Last, to plot your results, I use a boxplot (Note that I cleaned up your data because your example data does not have _B_ or _G_):

levels(df2$color) <- c("R", "W", "R", "W") ## remove missing colors, probably not needed with your actual data
boxplot(data = df2, value ~ variable * color)

Edit to explain a boxplot.

The above figure is a boxplot. Another CV question asked about them as well. Basically, a boxplot is a plot of the distributions. The center line is the median of the data. The top "half" of the box contains the 3rd quartile, the bottom half contains the 2nd quartile. The "stems" contain the 1st and 4th quartiles. Any outliers (if present) are depicted as dots.

The x labels are the different treatments with interactions (e.g., X-1.W depicts treatments X-1 and W). The y-axis is the units of your observation, which you do not indicate in your question. To add labels, use the standard R plotting options (e.g., boxplot(..., xlab = "x lab", ylab = "ylab")).

thanks for your message, actually I tired to repeat your path. I found a problem . I got this error for the real data > levels(df2$color) boxplot(data = df2, value ~ variable * color) Error in x[floor(d)] + x[ceiling(d)] : non-numeric argument to binary operator — Learner, Sep 30 '15 at 16:25
when you do TukeyHSD(simpleAov), does it mean that the X1-X2 are statistically significantly different ??? — Learner, Sep 30 '15 at 16:28
I posted a bigger and representative data with 4 different group and error in results — Learner, Sep 30 '15 at 16:37
A couple of points about code. 1) The problem is that your your `data.frame` is characters, not numeric. 2) When you run your real data, do not run the `levels(df2$color) — Richard Erickson, Sep 30 '15 at 16:58
The results from `TukeyHSD(simpleAov)` indicate that we fail to reject the null hypothesis ( `X1-X2` are the same) because P = 0.4771586. However, the other two were different because P < 0.00001. As a sidenote, with null-hypothesis testing be careful with your wording about rejecting — Richard Erickson, Sep 30 '15 at 17:01
what does the last figure mean? what are the x axis and y axis? — Learner, Oct 01 '15 at 07:38

How to perform ANOVA followed by a post-hoc for model comparison?

1 Answers1