How to determine if there is a statistical difference in the means of multiple groups

Question

I am a statistics newbie who is not only trying to learn statistics, but also R.

I have a list of data which contains 7 variables. I would like to determine if there is a statistical difference in the mean of each group. For sake of explanation (although this is not the actual data parameters), lets say I wanted to compare the weights of fish in 7 different ponds in 7 different states. Each pond has a different number of fish weights. Is there a way in R to automatically determine if there is a difference between each mean? If so, what is it and what function would be most appropriate. Thanks in advance. Also, I apologize for my ignorance as it relates to this topic; however I am trying to learn.

Sal Mangiafico · Answer 1 · 2019-10-13T21:21:40.673

This is meant to augment Brent Kerby's answer.

Unfortunately, some of the methods that are used in the native stats package in R aren't the most widely applicable. And unfortunately, these are the methods that you will see most commonly shown in examples.

For example, I don't recommend using the anova function for routine use for reporting the anova table from a linear model. That function uses type-I sums of squares, and that is probably not what you want. It doesn't matter in simple balanced designs, but as the models get more complicated and unbalanced, the answers from type-I and type-II sums of squares will be different.

One simple solution here is to use the Anova function from the car package.

For an example, first, let's install packages and make up some data.

### Install packages

if(!require(car)){install.packages("car")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(lsmeans)){install.packages("lsmeans")}

### Create some toy data

pond   = c(rep("Walden", 4), rep("Koi", 5), rep("Ness", 4))
weight = c(27,25,18,34,77,87,75,80,81,12,15,14,20)
fish   = data.frame(pond, weight)

Model and ANOVA:

Model = lm(weight ~ pond, 
           data = fish)
library(car)
Anova(Model)

Similarly, I would avoid post-hoc functions like TukeyHSD to do the mean separation after anova. This function is applicable only for balanced or mildly unbalanced data, and has other limitations.

Luckily there are the packages lsmeans and multcomp, which are far more broadly applicable.

So, in your example,

library(multcompView)
library(lsmeans)

leastsquare = lsmeans(Model,
                      pairwise ~ pond,
                      adjust = "tukey")

leastsquare

cld(leastsquare,
          alpha   = 0.05,
          adjust  = "tukey")

A relatively complete example of a one-way anova in R is here: R Handbook: one-way anova. (Caveat: I am the author of this page).

I know these seem to make using R more complicated, but you are better off learning the more flexible functions from the outset.

A couple of books that may help a beginner in analysis of experiments using R. (Caveat: without much statistical theory, and with some statements some statisticians may disagree with): The Handbook of Biological Statistics uses SAS, but has links to same examples and analyses in R. And Summary and Analysis of Extension Program Evaluation in R. (Caveat: I am the author of this second book.)

score 0 · Answer 2 · answered Jul 29 '17 at 13:20

The classical approach here is called a one-way ANOVA. If you have a data frame fish containing two columns, namely a weight column giving the weight of each fish and a pond column identifying the pond the fish is from (as a factor), then you can execute the ANOVA in R as follows

anova(lm(weight ~ pond, fish))

In the output, the value for PR(>F) gives you the P-value for testing the null hypothesis that the mean fish weight is the same in all 7 ponds.

The ANOVA procedure makes certain assumptions about the data. If these assumptions are violated, for instance if the fish weights are non-normally distributed (especially for small sample sizes) or if the variance of fish weight is substantially different in different ponds, then an alternative approach may be needed. You can find some pointers in the answers here: Alternatives to one-way ANOVA for heteroskedastic data.

You might also consider creating a chart showing a confidence interval for each of the seven means. In particular if the ANOVA leads you to reject the null hypothesis then this could help in interpreting the result.

How to determine if there is a statistical difference in the means of multiple groups

2 Answers2