1

The two density plots are of sliding window data, each window contains x occurrences in a window size of 2000, with the x-axis representing a percentage (x/2000*100). I want to test whether these two samples (blue n=20000, red n=250) are from the same population (and that the red group has a high(er) associated percentage based on a low p-value).

The distributions are not normal and have a different shape (according to KS test), so I think this rules out using the Wilcoxon rank sum test. The sliding window can become 100% saturated, hence the increased density towards 100%.

Questions:

Is there a test (alternative to wilcox test) to look for a significant difference between these two samples, given that the distributions have different shapes?

I realise the distributions look a little odd, due to the saturation of the windows, is this a problem? I used sliding windows to increase the sample size, e.g. instead of in the n=250 (red) sample using one percentage (e.g. 67%) over 100000, I used windows of 5000, since wilcox(blue_data, 67) with a single value would have little power.

Is it a problem that the data is percentages? I read that percentages can be treated as continuous in some cases, but I am not sure if that is the case here (perhaps this is count data?). I used from=0 and to=100 parameters for the density() function in R, but not sure if density of % data is OK.

I realise the wording of this might not be the best, but hopefully someone can steer me in the right direction.

enter image description here

mdewey
  • 16,541
  • 22
  • 30
  • 57
meld24
  • 75
  • 1
  • 9
  • 1
    Isn't the stark contrast in these empirical densities enough to draw the conclusion that the data are from different distributions? – whuber Mar 14 '17 at 17:33
  • In this case perhaps I could just compare the 67% in 100000 to the other sample, since I only used windows to increase the sample size, to enable me to use the wilcoxon test - maybe I could do without this? – meld24 Mar 15 '17 at 12:26
  • But if I have many samples to compare with the data in blue, is there a test that can be carried out? (maybe there is no simple test for such data?) - this would give a measure of how different they are without having to visualise the data – meld24 Mar 15 '17 at 12:29
  • 1
    There are many simple tests you can use, starting with a t test or a Wilcoxon test. Consider the position of a sceptic who maintains the data come from the same distribution. You apply a t-test, say, and conclude that the means are significantly different. "I don't trust that," says our sceptic, "because I think the t-test doesn't apply." Why not? "Because the t-test assumes the distributions have comparable variances and these obviously don't." Argument over: she has just conceded your point that the distributions differ. – whuber Mar 15 '17 at 13:01
  • I see, but why would we ever go beyond using a t-test? I read that t-test can be robust to Type I error (e.g. http://stats.stackexchange.com/questions/38967/how-robust-is-the-independent-samples-t-test-when-the-distributions-of-the-sampl), but regardless is there a case where the t-test could detect a difference when there isn't one? e.g. if the sample distribution shape was different in a particular way that would lead to a false positive for the t-test - in which case the distribution shape being different for that dataset would be a problem in terms of incorrectly detecting a difference – meld24 Mar 15 '17 at 16:27
  • I assume there must be, or why would the t-test have assumptions. I'm just getting a little confused as it seems like a circular argument, but in visualising this data, it certainly holds – meld24 Mar 15 '17 at 16:32

0 Answers0