How to test for differences between two group means when the data is not normally distributed?

Question

I'll eliminate all the biological details and experiments and quote just the problem at hand and what I have done statistically. I would like to know if its right, and if not, how to proceed. If the data (or my explanation) isn't clear enough, I'll try to explain better by editing.

Suppose I have two groups/observations, X and Y, with size $N_x=215$ and $N_y=40$. I would like to know if the means of these two observations are equal. My first question is:

If the assumptions are satisfied, is it relevant to use a parametric two-sample t-test here? I ask this because from my understanding its usually applied when the size is small?
I plotted histograms of both X and Y and they were not normally distributed, one of the assumptions of a two-sample t-test. My confusion is that, I consider them to be two populations and that's why I checked for normal distribution. But then I am about to perform a two-SAMPLE t-test... Is this right?
From central limit theorem, I understand that if you perform sampling (with/without repetition depending on your population size) multiple times and compute the average of the samples each time, then it will be approximately normally distributed. And, the mean of this random variables will be a good estimate of the population mean. So, I decided to do this on both X and Y, 1000 times, and obtained samples, and I assigned a random variable to the mean of each sample. The plot was very much normally distributed. The mean of X and Y were 4.2 and 15.8 (which were the same as population +- 0.15) and the variance was 0.95 and 12.11.
I performed a t-test on these two observations (1000 data points each) with unequal variances, because they are very different (0.95 and 12.11). And the null hypothesis was rejected.
Does this make sense at all? Is this correct / meaningful approach or a two-sample z-test is sufficient or its totally wrong?
I also performed a non-parametric Wilcoxon test just to be sure (on original X and Y) and the null hypothesis was convincingly rejected there as well. In the event that my previous method was utterly wrong, I suppose doing a non-parametric test is good, except for statistical power maybe?

In both cases, the means were significantly different. However, I would like to know if either or both the approaches are faulty/totally wrong and if so, what is the alternative?

score 24 · Accepted Answer · answered Sep 16 '11 at 16:38

The idea that the t-test is only for small samples is a historical hold over. Yes it was originally developed for small samples, but there is nothing in the theory that distinguishes small from large. In the days before computers were common for doing statistics the t-tables often only went up to around 30 degrees of freedom and the normal was used beyond that as a close approximation of the t distribution. This was for convenience to keep the t-table's size reasonable. Now with computers we can do t-tests for any sample size (though for very large samples the difference between the results of a z-test and a t-test are very small). The main idea is to use a t-test when using the sample to estimate the standard deviations and the z-test if the population standard deviations are known (very rare).

The Central Limit Theorem lets us use the normal theory inference (t-tests in this case) even if the population is not normally distributed as long as the sample sizes are large enough. This does mean that your test is approximate (but with your sample sizes, the appromition should be very good).

The Wilcoxon test is not a test of means (unless you know that the populations are perfectly symmetric and other unlikely assumptions hold). If the means are the main point of interest then the t-test is probably the better one to quote.

Given that your standard deviations are so different, and the shapes are non-normal and possibly different from each other, the difference in the means may not be the most interesting thing going on here. Think about the science and what you want to do with your results. Are decisions being made at the population level or the individual level? Think of this example: you are comparing 2 drugs for a given disease, on drug A half the sample died immediatly the other half recovered in about a week; on drug B all survived and recovered, but the time to recovery was longer than a week. In this case would you really care about which mean recovery time was shorter? Or replace the half dying in A with just taking a really long time to recover (longer than anyone in the B group). When deciding which drug I would want to take I would want the full information, not just which was quicker on average.

Thank you Greg. I assume there's nothing wrong with the procedure per-se? I understand that I might not be asking the right question, but my concern is equally about the statistical test/procedure and understanding itself given two samples. I'll check if I am asking the right question and come back with questions, if any. Maybe if I explain the biological problem, it would help with more suggestions. Thanks again. — Arun, Sep 16 '11 at 20:08

score 6 · Answer 2 · answered Sep 17 '11 at 11:02

One addition to Greg's already very comprehensive answer.

If I understand you the right way, your point 3 states the following procedure:

Observe $n$ samples of a distribution $X$.
Then, draw $m$ of those $n$ values and compute their mean.
Repeat this 1000 times, save the corresponding means
Finally, compute the mean of those means and assume that the mean of $X$ equals the mean computed that way.

Now your assumption is, that for this mean the central limit theorem holds and the corresponding random variable will be normally distributed.

Maybe let's have a look at the math behind your computation to identify the error:

We will call your samples of $X$ $X_1,\ldots,X_n$, or, in statistical terminology, you have $X_1,\ldots, X_n\sim X$. Now, we draw samples of size $m$ and compute their mean. The $k$-th of those means looks somehow like this:

$$ Y_k=\frac{1}{m}\sum_{i=1}^m X_{\mu^k_{i}} $$

where $\mu^k_i$ denotes the value between 1 and $n$ that has been drawn at draw $i$. Computing the mean of all those means thus results in

$$ \frac{1}{1000}\sum_{k=1}^{1000} \frac{1}{m}\sum_{i=1}^m X_{\mu^k_{i}} $$

To spare you the exact mathematical terminology just take a look at this sum. What happens is that the $X_i$ are just added multiple times to the sum. All in all, you add up $1000m$ numbers and divide them by $1000m$. In fact, you are computing a weighted mean of the $X_i$ with random weights.

Now, however, the Central Limit Theorem states that the sum of a lot of independent random variables is approximately normal. (Which results in also being the mean approx. normal).

Your sum above does not produce independent samples. You perhaps have random weights, but that does not make your samples independent at all. Thus, the procedure written in 3 is not legal.

However, as Greg already stated, using a $t$-test on your original data may be approximately correct - if you are really interested at the mean.

Thank you. It seems t-test already takes care of the problem using CLT (from greg's reply which I overlooked). Thanks for pointing that out and for the clear explanation of 3) which is what I actually wanted to know. I'll have to invest more time to grasp these concepts. — Arun, Sep 17 '11 at 12:37
Keep in mind that the CLT performs differently well depending on the distribution at hand (or, even worse, the expected value or the variance of the distribution do not exist - then CLT is not even valid). If in doubt it is always a good idea to generate a distribution that looks similar to the one you observed and then simulate your test using this distribution a few hundred times. You will get a feeling on the quality of the approximation CLT supplies. — Thilo, Sep 17 '11 at 18:12

How to test for differences between two group means when the data is not normally distributed?

2 Answers2

Linked