1

I am trying to test whether means of two populations with lots of zeros are different. Here is the following python code example:

from scipy.stats import mannwhitneyu

import numpy as np

a = np.random.random(100)
b = np.random.random(100) * 2

aa = np.hstack((a, np.zeros(1000)))
bb = np.hstack((b, np.zeros(1000)))

np.random.shuffle(aa)
np.random.shuffle(bb)

mw_stat_1, p_value_1 = mannwhitneyu(a, b)  # 100 obs in a and b
mw_stat_2, p_value_2 = mannwhitneyu(aa, bb) # 1000 zeors added to each a and b

# I take mean of sample of 20 elements, a and b become an array of 55 elements
samp_sum_aa = aa.reshape(-1,20).mean(axis=1)
samp_sum_bb = bb.reshape(-1,20).mean(axis=1)

mw_stat_3, p_value_3 = mannwhitneyu(samp_sum_aa, samp_sum_bb)

Result:

>>> p_value_1       # no zeros
2.5956488654494193e-09
>>> p_value_2
0.42124151395226317
>>> p_value_3       # using sampling
0.0020853586023447269

I find that if I do Mann-Whitney test on raw populations (after zeros are added), my p-value is large; however, if I take random samples, I get a small enough p-value, such that I can reject the null hypothesis for all practical purposes.

Is sampling a proper technique here? If so, how do I know what is the right sample size? Are there other methods to address this problem?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Akavall
  • 2,429
  • 2
  • 20
  • 27
  • "Is sampling a proper technique here?" -- could you clarify what you mean by this? Also, why do you want to use Mann-Whitney in the first place? ... Note: If you're going to try taking the means of subsamples of your data then you have to assume your data is either interval-scaled or ratio-scaled (i.e. you're implicitly assuming that your data is *more* than merely **ordinal-scaled**). – Steve S Aug 13 '14 at 21:55
  • I mean that can I conclude that the means of the distributions are different because I can reject the null hypothesis when comparing means of samples. Should be using another test? Does CLT allow me to use ttest here? – Akavall Aug 13 '14 at 22:20
  • Does the discussion [here](http://stats.stackexchange.com/questions/111320/should-i-use-t-test-on-highly-skewed-and-discrete-data) help? – Glen_b Aug 14 '14 at 01:42
  • Is here something specific you need to know that isn't discussed there? If yes, you should edit to emphasize those aspects of your question. If no, we should close this one as a duplicate. – Glen_b Aug 14 '14 at 02:10
  • @Glen_b, they way I understood the answer, is that one approach is too re-sample the data. Which I like (and is different from what I am doing). But I am not clear on the size of the sample and number of the samples. Thank you for the help. – Akavall Aug 14 '14 at 02:36
  • If you're referring to my answer, the resampling was done simply to check whether the t-distribution was a reasonable approximation with a distribution like that. The size of the sample(s) should be the size(s) of your data sample(s) (resampling with replacement from the same distribution for both in the case where you're interested in whether the t-distribution is a good approximation under the null. But that approach is only suitable when the original samples are quite large. If they were small, I'd make a few reasonable distributional assumptions and try those instead. ... (ctd) – Glen_b Aug 14 '14 at 02:51
  • (ctd)... In any case the number of simulations/resamples used should be large enough to give a good idea of the shape of the distribution and the probability of getting a type I error. (To investigate the impact on power behavior is more tricky - you pretty much have to set up a sequence of alternatives, which pretty much requires you to make a distributional assumption to work from.) – Glen_b Aug 14 '14 at 02:53
  • @Glen_b, thanks again. My data is pretty large, about a million observation points in each group. Wouldn't distribution of sample means be normal by CLT? And if so couldn't I just get, say 1000 sample means from each population and use a t-test? – Akavall Aug 14 '14 at 03:49
  • At a million observations, unless the proportion of non-zeros is extremely close to 0, not only should you be able to apply CLT, you should be able to reasonably assume Slutsky has kicked in as well. So t-tests will be z-tests, and everything should be fine. Mann-Whitney should also work just fine, though. – Glen_b Aug 14 '14 at 04:43
  • Although, speaking more strictly, we really mean something different in both cases (since both are asymptotic), but in practice it's quite reasonable to act as if he numerator is normal and the denominator is a fixed constant. – Glen_b Aug 14 '14 at 13:54

0 Answers0