I was playing a bit with simulations to get a better of picture of how unpaired t-test and Welch t-test compare when comparing same-mean data with uneven variance. As a third test, I included the rank-sum test. I noticed that when I repeatedly compare groups of 60 and 30 normally distributed samples (mean 0, first group having stdev of 1, the second stdev of 2), the rank-sum test has the tendency to give false positive significance, giving p<0.05 in 0.08-0.09 of the total number of simulation repeats (a similar trend to the non-Welch t-test). When the two groups have same size of 30 (with stdevs of 1 and 2 again), the p-value is just a shade over the expected 0.05.
Where does the high false negative rate of the rank-sum come from in the case with uneven group sizes? I think I understand what the null hypothesis is, but still don't immediately see how the group size affects the test.
Simulation code for just the rank-sum test in Matlab below:
nRepeats = 1e4;
%% We measure fraction of false positives for data with uneven variance, with:
% a) same-size group
% b) first group twice as large as the second one
meanVal = 0;
sd1 = 1;
sd2 = 2;
nSamples = 30;
pRSsame = zeros(nRepeats, 1);
pRSdifferent = zeros(nRepeats, 1);
for iRepeat = 1:nRepeats
% same-size
data1 = randn(nSamples, 1) * sd1 + meanVal;
data2 = randn(nSamples, 1) * sd2 + meanVal;
pRSsame(iRepeat) = ranksum(data1, data2);
% data1 larger
% same-size
data1 = randn(2*nSamples, 1) * sd1 + meanVal;
data2 = randn(nSamples, 1) * sd2 + meanVal;
pRSdifferent(iRepeat) = ranksum(data1, data2);
end
fractionFPsame = sum(pRSsame < 0.05)/nRepeats % ~0.055
fractionFPdifferent = sum(pRSdifferent < 0.05)/nRepeats %~0.088
Thank you!