Wilcoxon rank sum or alternative for heteroscedastic data

Question

The two density plots are of sliding window data, each window contains x occurrences in a window size of 2000, with the x-axis representing a percentage (x/2000*100). I want to test whether these two samples (blue n=20000, red n=250) are from the same population (and that the red group has a high(er) associated percentage based on a low p-value).

The distributions are not normal and have a different shape (according to KS test), so I think this rules out using the Wilcoxon rank sum test. The sliding window can become 100% saturated, hence the increased density towards 100%.

Questions:

Is there a test (alternative to wilcox test) to look for a significant difference between these two samples, given that the distributions have different shapes?

I realise the distributions look a little odd, due to the saturation of the windows, is this a problem? I used sliding windows to increase the sample size, e.g. instead of in the n=250 (red) sample using one percentage (e.g. 67%) over 100000, I used windows of 5000, since wilcox(blue_data, 67) with a single value would have little power.

Is it a problem that the data is percentages? I read that percentages can be treated as continuous in some cases, but I am not sure if that is the case here (perhaps this is count data?). I used from=0 and to=100 parameters for the density() function in R, but not sure if density of % data is OK.

I realise the wording of this might not be the best, but hopefully someone can steer me in the right direction.

Isn't the stark contrast in these empirical densities enough to draw the conclusion that the data are from different distributions? — whuber, Mar 14 '17 at 17:33
In this case perhaps I could just compare the 67% in 100000 to the other sample, since I only used windows to increase the sample size, to enable me to use the wilcoxon test - maybe I could do without this? — meld24, Mar 15 '17 at 12:26
But if I have many samples to compare with the data in blue, is there a test that can be carried out? (maybe there is no simple test for such data?) - this would give a measure of how different they are without having to visualise the data — meld24, Mar 15 '17 at 12:29
There are many simple tests you can use, starting with a t test or a Wilcoxon test. Consider the position of a sceptic who maintains the data come from the same distribution. You apply a t-test, say, and conclude that the means are significantly different. "I don't trust that," says our sceptic, "because I think the t-test doesn't apply." Why not? "Because the t-test assumes the distributions have comparable variances and these obviously don't." Argument over: she has just conceded your point that the distributions differ. — whuber, Mar 15 '17 at 13:01
I see, but why would we ever go beyond using a t-test? I read that t-test can be robust to Type I error (e.g. http://stats.stackexchange.com/questions/38967/how-robust-is-the-independent-samples-t-test-when-the-distributions-of-the-sampl), but regardless is there a case where the t-test could detect a difference when there isn't one? e.g. if the sample distribution shape was different in a particular way that would lead to a false positive for the t-test - in which case the distribution shape being different for that dataset would be a problem in terms of incorrectly detecting a difference — meld24, Mar 15 '17 at 16:27
I assume there must be, or why would the t-test have assumptions. I'm just getting a little confused as it seems like a circular argument, but in visualising this data, it certainly holds — meld24, Mar 15 '17 at 16:32

Wilcoxon rank sum or alternative for heteroscedastic data

0 Answers0