Compare which of two groups is MORE similar to a third group?

Question

I know I can use an ANOVA (+ post-hoc tests) to determine the equality (or inequality) of three groups:

(i.e., H₀: each of 3 means are equal; H_A: at least 1 of 3 means is not equal).

However, how do I determine which of 2 group means is more similar to a third group's mean?

Obviously I could do this using arithmetic by comparing the means of each group

i.e., something like: [(mean_A - mean_C)] < [(mean_B - mean_C)],

but I'm looking for a formal statistical test to answer this question.

Update:

A formal hypothesis might look something like:

H₀: [mean_A - mean_C] = [mean_B - mean_C]
H_A: [mean_A - mean_C] > [mean_B - mean_C]

(However, he hypothesis doesn't necessarily need to take that mathematical form.)

EXAMPLE:

Are the average heights of cherry trees or maple trees closer to the average height of apple trees?

ANOVA is not used to show *equality* but rather *inequality* of group means. As such, the question is not easy to follow. — Michael M, Sep 26 '19 at 19:27
One suggestion I have is the Kullback–Leibler divergence, but it will depend immensely on what you consider similarity to mean. Please elaborate on what you want to be similar. Means? Means and variances? Medians and skewness? Entire distributions? — Dave, Sep 26 '19 at 19:27
@Dave the implicit sense of "similar" in an ANOVA context *must* be in terms of differences of means. Like Michael M, though, I find it hard to follow this question because it is predicated on a misinterpretation of ANOVA and also because "which of two groups is similar to a third" is ambiguous: exactly what hypothesis is being proposed here? — whuber, Sep 26 '19 at 19:49
@whuber I took the ANOVA comment as an analogy, not that means are the only way that groups 1 and 2 could differ from group 3. Anyway, yes, we definitely need additional clarification about what comparison is interesting. However, what would you think about comparing KL divergences if the question isn't just about similarity of means but of the distributions overall? — Dave, Sep 26 '19 at 19:56
@Dave I have a hard time with KL divergences because--unless I haven't correctly apprehended your proposal--first you need to perform some kind of continuous density estimation based on the data and, without huge amounts of data, I believe the resulting estimates of KL divergences to be unstable and sensitive to the choice of density estimator. I am more inclined to explore with a client *which aspect(s) of the distributions are important for their study* so we can choose (or develop) procedures suitable for comparing (and discriminating) those particular features. — whuber, Sep 26 '19 at 20:03
If you find a scale on which variability is about equal, your question may well answer itself if you look at the distributions (meaning, above else plot the data). I don't know how serious cherry, apple and maple tree heights are as an example, but it's the only one in sight. I would expect mild skewness at least of each distribution. and heteroscedasticity collectively, but would also expect some transformation between square root and logarithm to work quite well in making broad contrasts between distributions as clear as possible. — Nick Cox, Sep 30 '19 at 14:11

score 3 · Answer 1 · answered Sep 30 '19 at 08:22

If you randomly draw one individual from each group, you can calculate the value [A - C] - [B - C]. If you repeat this a number of times (with replacement), you will generate a distribution of [A - C] - [B - C].

This distribution captures the information you are interested in. If its mean value is >0, A is closer on average to C. If its mean value is <0, B is closer on average to C. You can also calculate confidence intervals and p-values. The p-values correspond to the % of the distribution values that are below or above zero (depending on whether the mean is positive or negative).

score 0 · Answer 2 · answered Oct 13 '19 at 11:46

Some ideas: I will write a formal model in ANOVA style using normal assumptions, but that part can surely be relaxed. So let $Y_{ij}$ be independent observations, $Y_{ij} \sim \mathcal{N}(\mu_i, \sigma^2)$ for $j=A,B,C$, $i=1, \dotsc, n_j$. I will measure "similarity" of the distributions by the absolute value of the difference of means, so define $$ \delta_A=\mu_A-\mu_C, \delta_B=\mu_B-\mu_C $$ and then our interest or focal parameter as $$ \Delta=\delta_A-\delta_B. $$ For a fast solution (which also avoids the normality assumption) I would use bootstrapping to construct a confidence interval for $\Delta$. But that assumes a reasonably large sample size. A more principled (and maybe better) solution would be to construct the profile likelihood function for $\Delta$, and get a confidence interval from that. See for instance Constructing confidence intervals based on profile likelihood.

score 0 · Answer 3 · answered May 05 '20 at 19:27

This is a nice question. If you are using STATA, things are simple. You first run your AN(C)OVA including your factor variable (of 3 groups, say A, B, C) and any covariates. Then, you calculate contrasts of margins (with say C as the reference group), i.e. A-C and B-C with respective p-values (unadjusted or adjusted for multiple testing). This is implemented with various commands: contrast; margins (factor v.), contrast; margins r.(factor v). Finally, there is the user-defined command 'mlincom' which allows for comparisons (i.e. contrasts) of previously calculated marginal contrasts, e.g. mlincom 2-1. As simple as that. Look up all commands in STATA help files. Also have a look at https://xiangao.netlify.app/2019/04/22/marginal-effects-in-margins/. I hope this was helpful.

Compare which of two groups is MORE similar to a third group?

3 Answers3