Mann-Whitney or t-test to compare age and expenditure after clustering

Question

I have carried out cluster analysis and now want to compare means between variables in different clusters. The variables in question are age and expenditure in millions of dollars.

The age variable does not follow a normal distribution: as a result, I was considering a Mann-Whitney test. The expenditure in millions of dollars fails the assumption of equality of variances.

Having stated this, although all tests seem to suggest that age does not follow normal distribution, I am not quite sure about the extent of this.

Histogram Age Cluster 1

Histogram of age in cluster 1

Histogram Age Cluster 2

Histogram of Age in Cluster 2

Box plot of age in cluster 1

Box plot of Age in Cluster 1

Box Plot - Age - Cluster 2

Box Plot of Age in Cluster 2

It has been suggested to use the Mann-Whitney test in this case, given that the assumption of normality is "not met".

Does Mann Whitney work fine with continuous data? This link seems to suggest it does. Would SPSS automatically convert these into ranks?
Zimmerman argues that t-test should work fine because it is scarcely affected by non-normality of the population!
Sheskin (2007) suggests using a t-test anyway but using a more conservative approach (e.g critical values of t(0.01) instead of t(0.05).

How can I resolve this problem?

Links to previous questions that might be interesting: http://stats.stackexchange.com/questions/38967/how-robust-is-the-independent-samples-t-test-when-the-distributions-of-the-sampl http://stats.stackexchange.com/questions/2541/is-there-a-reference-that-suggest-using-30-as-a-large-enough-sample-size http://stats.stackexchange.com/questions/53053/mann-whitney-or-two-tailed-t-test http://stats.stackexchange.com/questions/15664/how-to-test-for-differences-between-two-group-means-when-the-data-is-not-normall — Gala, Jul 11 '13 at 11:16
Zimmerman is quite mistaken except possibly in the special case where $\sigma$ is a known constant. — Frank Harrell, Jul 11 '13 at 12:36
This is just the introduction, certainly not something Zimmerman “argues”… — Gala, Jul 11 '13 at 13:55
@IdiotAbroad That's not the way you should read scientific papers, you should quote it for its main result/argument. — Gala, Jul 11 '13 at 15:10
I have read the whole paper, what he mentioned there is crucial, and is well referenced. — Cesare Camestre, Jul 11 '13 at 15:58

Gala · Accepted Answer · 2013-07-11T11:51:12.463

7

There are many questions on this already, just have a look using the search function. Some details of your questions however seem to warrant some specific remarks:

Mann-Whitney U test works fine with continuous data, I would even say it works best with them because you would avoid ties.
The t-test has indeed been found robust to some violations of its assumptions but not to all of them, especially if they happen concurrently. Larger sample sizes help to relax these constraints. You can find many information on this elsewhere on this site.
Point 3 is surprising. For one, the whole point of a test is to offer some guarantees regarding the error level, provided the assumptions are met. If you can't achieve that, just picking an arbitrary “conservative” level just muddles the situation further. Better give up the test entirely. Furthermore, one common problem with the t-test and non-normal data is lack of power. A lower threshold just makes this problem worse. All this would seem to make the result very difficult to interpret one way or the other.
I would generally be skeptical of tests between groups that are not defined a priori, certainly if the variables you are comparing were also used for the cluster analysis. All this sound a bit too exploratory for tests to be meaningful. You might just as well plot the data and comment what you see, understanding that you are just providing a tentative interpretation.

Practical recommendations in light of your comments:

Mann-Whitney is perfectly fine but do realize it is not a test of the difference in means. It might or might not be a problem for you but the most important point is that you cannot just think of this problem as “normal data => t-test, non-normal => Mann-Whitney U”. There is a lot more going on (check the links I added as a comment to the question for more on that).
The t-test might be fine. I already wrote that a hard-and-fast threshold would be very questionable and it's still impossible to give advice based only on the notion that the data are “non-normal”. Whether it matters or not depends on the specific ways in which they are non-normal.
300 observations is already quite comfortable. Do run both tests, possibly some other alternatives as well (permutation test, bootstrap test of the median or another robust estimator of location if that makes sense…). Also inspect the distribution and the residuals. You might very well find all this point to broadly similar conclusions and would not need to worry about this further.
You said that the two variables are not the “main” predictors in the cluster analysis but are they in the analysis at all? I would still not be fully convinced of the value of the whole approach but you should at least keep them entirely separate I think.
Don't overestimate tests. Since you are happy using an exploratory method like cluster analysis, do also plot the data and interpret that in any case.

edited Jul 11 '13 at 11:51

answered Jul 11 '13 at 10:49

Gala

8,323
2
28
42

1

This answer and mine were being written simultaneously. They look entirely consistent to me. – Nick Cox Jul 11 '13 at 10:51
All three of us were writing at the same time! And all three answers are consistent and somewhat complimentary. – Peter Flom Jul 11 '13 at 10:55
- I did conduct a search but specific answers to my questions were not provided - Point 2 re t test assumptions, - always confused by what you mean by larger samples, each sample has around 300 items in it - Point 3 was a quote from the Handbook of Parametric and Non Paramtric Stats by sheskin. – Cesare Camestre Jul 11 '13 at 10:58
@NickCox Yes, indeed, but I forgot to mention the fact that Mann-Whitney U compares different hypotheses than the t-test and is not a drop-in “non-parametric” replacement for it as it is sometimes presented, an important point as well. – Gala Jul 11 '13 at 10:59
2

@Peter Flom. Glad you agree, but you mean complementary... Insert emoticon if desired. – Nick Cox Jul 11 '13 at 11:02
1

@IdiotAbroad What I mean is that the larger it is, the “nicer” the sampling distribution of the mean even if the distribution of the data is non-normal. It's probably impossible to provide a hard-and-fast threshold which is why you will find a lot of these confusing noncommittal recommendations. – Gala Jul 11 '13 at 11:06
I do understand your points here. Let me clarify that these are not the main predictors in the cluster analysis. A priori I would expect differences in the means of these two sub-samples (based on literature). Now given that age does not follow normal distribution - what is the suggestion here. – Cesare Camestre Jul 11 '13 at 11:12
The practical recommendations, where what I was after. As to your comment that that t-test might still be fine, I posted a q-norm plot of age, if that helps in anyway to give some more insight. – Cesare Camestre Jul 11 '13 at 11:35
1

Not sure what to make of this last plot. This variable seems in fact discrete but not so bad, considering. Your last edit suggests that by “non-normal”, you mean you rejected normality in some test; I don't think [it matters in the least](http://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless). In any case, what I would look at are density plots, boxplots or stripcharts *of each group/cluster*, looking for differences in the shape or variance of the distribution. – Gala Jul 11 '13 at 11:46
Gael, I updated the posts and posted some of the plots you suggested. Of concern is probably the box plot of age in cluster 1. – Cesare Camestre Jul 11 '13 at 12:12
1

@IdiotAbroad I implicitly suggested *you* look at them… I can't just make the decision for you, without knowing about the project, over some Internet Q&A site. – Gala Jul 11 '13 at 12:15
Its not a matter of making a decisions. I just want to gather views as to wether i should use the Mann Whitney or the the t – Cesare Camestre Jul 11 '13 at 12:40

Nick Cox · Answer 2 · 2013-07-11T10:55:14.637

What is the problem that you are trying to solve?

It is the job of cluster analysis to find clusters. Testing whether those clusters really exist post hoc is, in my view, somewhere between meaningless and dubious. To strip the problem down to the bare minimum, imagine one variable that is a continuum, say people's heights, and you group height values into two clusters. Now do you want to test that short people and tall people have different means? Now explain why the problem of three or more clusters and/or two or more variables makes the problem different.
You describe the data as being age and expenditure. If that's so, then a scatter plot showing the data will show clusters clearly if they exist and a continuum of variation if they don't. You can be flexible about scales (e.g. logging expenditure). What cluster analysis might add to this of real scientific value is an open question. Cluster analysis divides statistical people into two clusters, those who think it a central technique and those who think it oversold snake oil or worse.
As with #2, if you are clustering on two variables, then each cluster has two means, mean age and mean expenditure. Even if you have a good answer to #1, univariate tests don't capture the differences between clusters.
If #2 and #3 are wrong, and you have other variables, then the problem is not as you stated and you need to explain why.
If you are interested in means, then Mann-Whitney is not an alternative method of comparing means.
Mann-Whitney uses ranks, so it doesn't even know whether the original data are discrete or continuous. In practice, discrete data are more likely to show ties and that can have a secondary effect on the test. Whether your software adjusts for ties depends on what it is.
How many clusters do you have any way? Is it just two?

I can't advise on SPSS. I often advise against SPSS, but that's a prejudice.

Please give a reference for Sheskin (2007).

1. Having identified the clusters.. I am working further on the results obtained. I am not testing the most important predictors of the cluster analysis. 2. As to the Mann Whitney as not being an alternative in comparing means, I followed Field (2009), in a chapter on comparison of means. He is suggesting using Mann Whitney as an alternative when the normal distribution assumption does not hold. As to your seventh question, yes it is two clusters. — Cesare Camestre, Jul 11 '13 at 11:01
You seem to have thought that I'm clustering on the basis of only these variables, which is not the case. These are the low predictors. — Cesare Camestre, Jul 11 '13 at 11:17
You said nothing about any other variables, although as I stated I did guess that they might exist. I don't know what "low predictors" are, but this information doesn't solve any of the difficulties raised by your question. — Nick Cox, Jul 11 '13 at 11:23
Its unrelated to the question. Let's forget about the clusters for a while. Bascially I have two samples. I want to test whether the mean is different in one sample from the other. — Cesare Camestre, Jul 11 '13 at 11:27
I've now posted a q-norm plot for age. The variation from normal distribution might not be to of concern, but still the normality diagnostic tests fail to suggest a normal distribution — Cesare Camestre, Jul 11 '13 at 11:31
Sorry, but I can't buy the premise that the previous cluster analysis is irrelevant. It makes the whole idea of what you are doing moot. Alternatively, you are pushing the thread in the direction of when t, when Mann-Whitney on which there are already numerous threads, which you should read. — Nick Cox, Jul 11 '13 at 11:36

score 3 · Answer 3 · answered Jul 11 '13 at 10:54

3

First, realize that the Mann Whitney U test and the t test test different things: The t test is a test of differences in means, the U test is a test of entire distributions. It is possible that the means could be the same and the distributions different (although this would only happen for odd distributions, as far as I can tell, for example

set.seed(12345)
x <- rnorm(1000)
y <- c(rlnorm(500), runif(500,-100,100))
wilcox.test(x,y)
t.test(x,y)

where the Wilcox test rejects at $p = 2.2 * 10^{-16}$ and the t does not reject at all.

Second, while the robustness of the t as a test of means depends on exactly how the data are non-normal, sample size, variances and so on, the fact that it is a test of means remains. If age is not normally distributed you may not want to test the means.

answered Jul 11 '13 at 10:54

Peter Flom

94,055
35
143
276

The question no post is helping me to establish is the "how non normal". It seems to judgemental to me. – Cesare Camestre Jul 11 '13 at 12:42
1

"judgemental" is probably not the right word for what you want to say. I expect "subjective" comes closer to what you want. – Maarten Buis Jul 11 '13 at 13:04
Correct. Subjectivity! – Cesare Camestre Jul 11 '13 at 13:06
There can be no "rule". One reason is that there are just too many ways in which a distribution can deviate from Gaussianity, and it depends on these circumstances whether or not that is a problem. – Maarten Buis Jul 11 '13 at 13:13
And I have been trying to explain the circumstances but noone seems to be pointing me in the right direction. Not even books, and articles in which there seems to be no concensus on the use of the t-test to compare means in a "non-normal" situation.. and is it non-normal in the end? – Cesare Camestre Jul 11 '13 at 13:17
What is the right direction? Many of the comments here imply that you are looking in the wrong direction here in several senses. Sorry if that's unwelcome, but no one is going to be less than candid about the difficulties. – Nick Cox Jul 11 '13 at 13:33
I beg to differ, many comments were useful, and I did mark some of them as being so. They just leave some questions unanswered though. – Cesare Camestre Jul 11 '13 at 13:43
2

The circumstances are necessarily more rich than you can describe on an internet forum, so however hard you try it will never be enough for us to make a decision for you. In the end this can only be your decision and your decision alone. – Maarten Buis Jul 11 '13 at 13:43
Certainly, maarten, I understand that point. I did get some useful answers to my posts here, but not specifically on the normality issue. – Cesare Camestre Jul 11 '13 at 13:46
2

That is because that answer does not exist in general; it is necessarily a judgement call. In situations like these I like to run simulations. That is a great way to develop an intuition for the problem, and you can use that to get an answer that is tailor made for your data and problem. – Maarten Buis Jul 11 '13 at 13:52

Glen_b · Answer 4 · 2013-07-11T15:16:21.993

3

The derivation of the Mann Whitney assumes continuous data. When you have heavier-tailed than normal data, it's also typically more powerful than the t-test; if you assume only location shift alternatives, it's a test of difference in means (along with any other reasonable location measure); if that doesn't hold, it's testing something else.

That said, the t-test can tolerate moderate skewness and heavy-tailedness (though in the latter case your actual significance levels will tend to be lower than the nominal $\alpha$).

There's also the possibility of a permutation test rather than either of the choices you mention - it would allow you to test a difference in means and have it be valid when the assumptions of the t-test are not satisfied.

Another possibility if you think that some exponential family distribution might suit better (such as a gamma distribution) would be to fit a GLM with a group factor representing the groups whose means you are comparing. An identity link will even give you a direct estimate of the difference in means.

edited Jul 11 '13 at 15:16

answered Jul 11 '13 at 14:58

Glen_b

257,508
32
553
939

Can you clarify your phrase "if you assume only location shift alternatives" @Glen_b – Cesare Camestre Jul 11 '13 at 16:02
"t-test can tolerate moderate skewness and heavy-tailedness (though in the latter case your actual significance levels will tend to be lower than the nominal α)", any reference for that and limits for "heavy tailedness" – Cesare Camestre Jul 11 '13 at 16:06
Consider a null where (possibly combined with the assumptions) two distributions are identical. Consider an alternative where they're identical, except the distribution of one is 'shifted along' by some nonzero amount ($F_X(t) = F_Y(t+\delta)$, where $\delta \neq 0$, say). That alternative describes a location-shift. https://en.wikipedia.org/wiki/Location_parameter – Glen_b Jul 11 '13 at 16:08
Is there some test I can run to test if the distributions are identical.. except for this location shift? how can this be determined? – Cesare Camestre Jul 11 '13 at 16:12
We're discussing an *assumption*. **IF** you make the assumption, then it's also a test of means. If you fail to make the assumption, it won't be. – Glen_b Jul 11 '13 at 16:18
Right how can I ascertain whether im reasonably correct in making that assumption. – Cesare Camestre Jul 11 '13 at 16:44
1

Well, you could assess the reasonableness of the assumption visually - "does it look like one sample's distribution is a shifted version of the other or doesn't it?" - But if you have no particular reason to think it should be a location shift, you might also question whether a test of location (whether via means or anything else) is particularly meaningful anyway. Which really comes down to asking yourself "what is it I actually want to find out?" – Glen_b Jul 12 '13 at 02:52

Mann-Whitney or t-test to compare age and expenditure after clustering

4 Answers4