Testing several means

Question

I am currently working on cross country responses to a survey question. More specifically, the questions asks 'are you satisfied with your job?' and lets respondents select a number from 0 (strongly dissatisfied) to 4 (strongly satisfied). The question was asked to people in Germany, United Kingdom, Spain and Italy. The sample size in every country is 2000. I am interested in determining whether the responses are statistically different between countries. What is the best way to do so?

You could present your data as a contingency table, then maybe simply a chisquare test — kjetil b halvorsen, Feb 24 '20 at 15:07
Thank you for the reply! Possibly I wasn't clear enough: what I want to test is whether the mean response for the survey in Italy is statistically different from the mean response in Spain etc. This entails that my my null is mean Italy= mean SPAIN=mean UK= mean Germany. So I am not really sure whether the chisquare test is the best test in this case... — Matrix2020, Feb 24 '20 at 15:15
I think that it's too complicated for what I am trying to do. In sum, I am just trying to do a t-test but instead to compare 2 means I will compare 4 but I don't know how! — Matrix2020, Feb 24 '20 at 15:41
Why do you want a t-test, when what you have is ordinal data? Try then ordinal regression, not mutinomial, as I said above. A t-test is **not** what you need. Can you show us the data? — kjetil b halvorsen, Feb 24 '20 at 15:44
The survey asks 'Are you satisfied with your job?' and the respondent can choose between 4 option: strongly dissatisfied, dissatisfied, satisfied, strongly satisfied. Each option has a numerical value, going from 0 (strongly dissatisfied) to 4 (strongly satisfied). In each country 2000 people responded to the survey. The mean response in Italy was 1.54 ; in Spain it was 2.01; in the UK it was 2.39; in Germany it was 3.51. What I want to know is if the average responses in the 4 countries are significantly different from each other. So the null would be mean IT= mean SP=mean UK= mean Germany. — Matrix2020, Feb 24 '20 at 16:25
How would you answer the question though? Would doing a one way ANOVA answer it? — Matrix2020, Feb 24 '20 at 17:12
Treating these values as numerical does not really make sense. Just because you *can* calculate a mean and run a statistical test does not mean you *should*. Also, this is a very subjective scale, who's to say that a 2 in UK equals a 2 in Germany? — CFD, Feb 24 '20 at 18:09
It has been argued (in the statistical literature) that treating these as numbers and comparing their means makes perfect sense. (Look up Lord's paper on "football numbers.") This can be problematic with small datasets or when the answers are all skewed toward one extreme, but here it's at least plausible that a simple ANOVA will do the job. In that case, because the variance of each group cannot exceed $((5-0)/2)^2,$ one need only glance at the reported means to determine the differences are significant and to put them into a clear order from smallest to largest. — whuber, Feb 24 '20 at 19:48

score 2 · Answer 1 · edited Jun 11 '20 at 14:32

Let's apply common sense and a little statistical understanding to cut through the complications.

In a comment you write

The mean response in Italy was 1.54 ; in Spain it was 2.01; in the UK it was 2.39; in Germany it was 3.51.

As a mathematical proposition, the variance in a group of numbers bounded by $0$ and $5$ with a mean of $\mu$ cannot exceed $(\mu-0)(5-\mu),$ whence the sampling variance of the mean of $n=2000$ such numbers cannot exceed

$$\frac{\mu(5-\mu)}{2000}.$$

Applying this formula to the given mean response gives the corresponding set of maximal standard deviations of approximately

$$0.052, 0.055, 0.056, 0.051.$$

The closest difference between any two successive means is $2.39- 2.01 = 0.38,$ whereas the sampling standard deviation for that difference would be $\sqrt{0.055^2 + 0.056^2} = 0.078,$ making that difference exceed

$$\frac{0.38}{0.078} = 4.8$$

standard errors. The Z-scores for the other two successive differences are larger still.

Assuming an approximately Normal distribution of differences, that would produce a p-value less than one in a million. Let's adjust that conservatively by multiplying it by all the possible between-group comparisons we might make, which is six. The result is still tiny. That's strong enough to conclude, without any more ado, that

(1) there are significant differences in mean survey responses and (2) every pairwise difference is significant, too.

Comments and discussion

If you're not convinced, you may extend this analysis to construct a lower bound on the between-group sum of squares and thereby obtain a lower bound on the one-way ANOVA F-ratio statistic. It will have an extremely small p-value.

We might worry a little about the implicit use of ANOVA underlying this reasoning. However, none of these four sample means is close enough to the extremes of $0$ and $5$ to be overly concerned about the effects of possible skewness in the group distributions and the bounded responses guarantee that high kurtosis will not be an issue. The usual Normal-theory distribution calculations are going to be quite accurate.

There are legitimate questions surrounding such a result. Did the surveys really measure the same thing if they were presented in different languages? How does one interpret a difference of, say, $0.38:$ what does that say about the responses? But if we accept that each of the four surveys was accurately conducted on a truly random sample of a well-defined population, then it is incontrovertible that the differences observed are extremely unlikely to be due solely to the random selection of subjects. That's what "statistically significant" means.

Wouldn't the upper bound be $4$, according to the question by OP? — COOLSerdash, Feb 24 '20 at 20:41
@COOL I had to read the information carefully. Despite giving a range of 0 to 4 in the question (which one might think is *five* options), a comment states there are only four options (and names them). I conservatively chose the largest number that could possibly apply, figuring the correct value for the range was likely either 4 or even 3 (for a 1 - 4 scale). — whuber, Feb 24 '20 at 20:58
@whuber: how to verify the results using built in function: `prop.test()`? — Maximilian, Mar 02 '20 at 17:25
@Max Because these data aren't proportions, it's difficult to see how `prop.test` could produce useful results even if it could be applied. — whuber, Mar 02 '20 at 17:30
yes, true, I was wondering, because prop.test should have a z-test built in, I believe. I'm struggling with this question, would the same approach apply here?: https://stats.stackexchange.com/questions/451123/test-the-heterogeneity-among-groups — Maximilian, Mar 02 '20 at 17:32
maybe rather followup question, wouldn't be more feasible to use chisq distr., by taking `sqrt()` of the `z-score` statistics and using `qnorm(1-0.05/2)`? — Maximilian, Mar 02 '20 at 21:20
@Max I don't follow. In general a z-score can be negative as well as positive, so taking square roots looks like a non-starter. — whuber, Mar 02 '20 at 21:27
I’m sorry, just mixed up z-score with z statistic. It seems that reviewing lots on topic I‘m trying to solve. — Maximilian, Mar 02 '20 at 21:31
@Max A z statistic typically is a z-score, so the same issue pertains. — whuber, Mar 02 '20 at 22:37
I meant actually to square `(z-score)^2` and not `sqrt()` which would follow approx. chi-sq. distribution and using rather this statistics to define the `p-value`, especially in smaller samples. — Maximilian, Mar 03 '20 at 13:54

score 1 · Answer 2 · answered Feb 24 '20 at 19:45

You could use an ordinal ogistic regression, one example is here: Alternatives to one-way ANOVA for heteroskedastic data

I will make an example with some simulated data, using the package MASS in R. I will simulate data from the null.

N <- 2000
p <- c(0.1, 0.2, 0.3, 0.4) # We simulate from the NULL
set.seed(7*11*13)
country1 <- sample(1:4, N, TRUE, p)
country2 <- sample(1:4, N, TRUE, p)
country3 <- sample(1:4, N, TRUE, p)
country4 <- sample(1:4, N, TRUE, p)

# Amass the variables in format for regression:

Y <- c(country1, country2, country3, country4)
Country <- factor(rep(paste("Country", 1:4, sep=""), rep(N, 4)))

simdata <- data.frame(Y=as.ordered(Y), Country)

mod.polr <- MASS::polr(Y  ~ Country, data=simdata, Hess=TRUE)
mod.0 <- MASS::polr(Y  ~  1, data=simdata, Hess=TRUE)

 summary(mod.polr)
Call:
MASS::polr(formula = Y ~ Country, data = simdata, Hess = TRUE)

Coefficients:
                   Value Std. Error t value
CountryCountry2 -0.03197    0.05767 -0.5543
CountryCountry3  0.03639    0.05790  0.6285
CountryCountry4 -0.03094    0.05765 -0.5367

Intercepts:
    Value    Std. Error t value 
1|2  -2.1959   0.0513   -42.8244
2|3  -0.8974   0.0431   -20.8418
3|4   0.4181   0.0421     9.9349

Residual Deviance: 20437.80 
AIC: 20449.80 

anova(mod.0, mod.polr)
Likelihood ratio tests of ordinal regression models

Response: Y
    Model Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
1       1      7997   20439.67                                
2 Country      7994   20437.80 1 vs 2     3 1.867304 0.6003995

score 0 · Answer 3 · answered Feb 24 '20 at 18:16

You might try the Kruskal-Wallis test. This is similar to an ANOVA but considers rank instead of actual value. If you find a significant difference you can then look at pair-wise Mann-Whitney U tests to determine which countries are significantly different from one another.

Testing several means

3 Answers3

Comments and discussion