2

Consider two sets of data points A and B. Both these data points are from mixture of unknown number of Gaussians. The mean of the Gaussians are little different for each set (there may have few overlap or very close separated mean values). However for both cases the variance of all the Gaussians are small. Now, if we give a set of data point say C, how to estimate C is from A or from from B? I understand there are many methods to do so: is there a way tell the most efficient method? This is a very board question, so specifically can we compare the KS test https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test and https://en.wikipedia.org/wiki/Wasserstein_metric for this problem? Is there a way to prove that KS test/Wasserstein metric would give better estimate?

It appears to me that Cumulative distribution is not smooth so Wasserstein metric would be better, is it true?

Creator
  • 209
  • 1
  • 10
  • A few questions. 1) You refer to A and B as sets of distributions, then later as distributions themselves--which case are you asking about? 2) What does it mean to to estimate whether a distribution is from another distribution? 3) Since you use the word estimate, are these distributions themselves fit to sampled datapoints? It might help clarify things to say a little more about the goals of this analysis – user20160 Sep 18 '18 at 02:54
  • @user20160 Thanks a lot, very good point. Is it clear now? – Creator Sep 18 '18 at 03:01
  • Yes, it's clearer now. Two remaining questions. 1) Are A and B known to be generated by different distributions? 2) "how to estimate C is from A or from from B?" Does this mean you want to determine whether C was generated by the same distribution that generated A or the one that generated B? On a related note: do you know that C was definitely generated by one of these two distributions (but are uncertain as to which)? – user20160 Sep 18 '18 at 03:54
  • @user20160 Yes to 1). A and B are known to generate from different distribution (physically different but close) 2) yes, I want to know from where C was generated. Yes, it must be generated either form A or B no other. – Creator Sep 18 '18 at 03:59
  • Cross-posted: https://mathoverflow.net/q/310928/37212, https://math.stackexchange.com/q/2920955/14578, https://stats.stackexchange.com/q/367372/2921. Please [do not post the same question on multiple sites](https://meta.stackexchange.com/q/64068). Each community should have an honest shot at answering without anybody's time being wasted. – D.W. Apr 16 '19 at 06:01
  • mathoverflow.net and math.stackexchange.com questions are closed, leaving only this one (the stats.stackexchange post). – Sterling Sep 18 '21 at 22:43
  • Also, related: [What is the advantages of Wasserstein metric compared to Kullback-Leibler divergence?](https://stats.stackexchange.com/q/295617/293880) – Sterling Sep 18 '21 at 22:49

0 Answers0