0

Is there a way for compare the distribution of different set of samples? For example, I have three sets, for example:

  • X1 = N(0, 1);
  • X2 = N(0.5, 1);
  • X3 = N(1, 1).

Each set is drown with a specific (and unknown) distribution. I use Gaussian Kernel Density Estimations (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html) for estimating the three pdf. Now, I would like to measure the similarity (or divergence) between the three pdf, and have as result that X1 is more similar to X2.

How can I do it? Or are there other ways to i) estimate a pdf and *ii)*compare different pdfs?

Luca
  • 1
  • 2
  • 1
    If the distribution is unknown, then you don't have a population. What it sounds like you want to do is compare the KDE approximation of X1 to the KDE approximation of X2. Is that correct? If so, why do you want to do that instead of comparing the samples drawn from your N(0,1), N(0.5,1), and N(1,1) distributions? – Dave Aug 12 '19 at 15:47
  • With unknown distribution I meant that I don't know, e.g., that X1 is sampled from N(0,1). Yes, you are right, I can compute the distance between X1 and X2 for example, but I was also looking for a way for measuring the similarities between KDE – Luca Aug 12 '19 at 16:01
  • If you do a KDE, you basically have a population and can run the population-level equations (such as Kullback-Leibler divergence). Just remember that you're not actually dealing with a population, just that you have an equation of the pdf that you can integrate. But I'm really curious why you want to compare the KDEs rather than the samples. The KDEs aren't the true populations. – Dave Aug 12 '19 at 17:26
  • Because It seems not naive to calculate metrics such as KL-divergence in continuous spaces. For what I see, you can easily compute it if you, for example, assume that it is a multivariate normal distribution and you can thus derive the equation easily (e.g., https://stats.stackexchange.com/questions/60680/kl-divergence-between-two-multivariate-gaussians). – Luca Aug 16 '19 at 13:34
  • 1
    Your KDE won't be normal, and it doesn't seem like you're working in a multivariate setting, either. Still, however, I don't see why you want to compare the KDEs instead of the samples. – Dave Aug 16 '19 at 13:51
  • I actually work with both univariate and multivariate samples. But let's keep it simple, how can I easily compute it in the univariate settings? I just didn't found a solution for doing it directly between two populations – Luca Aug 16 '19 at 13:55
  • 1
    There are a few issues. First, why do you want to evaluate KL divergence on the KDEs? Second, you have samples, not populations. – Dave Aug 16 '19 at 14:02
  • What I have is different sets of samples (X1, X2, X3) sampled from three unknown distributions. I use KDE in order to estimate a pdf and future re-sampling. I should also study how these unknown distributions differs (e.g., similarity or divergence). – Luca Aug 16 '19 at 14:07
  • You are right, I have samples and not population. I used a wrong terminology, now I'll fix it – Luca Aug 16 '19 at 14:09
  • If you're going to sample from the KDEs, then you have the populations generating your samples. – Dave Aug 16 '19 at 14:10
  • Still I don't understand how to do it. In that case, I need to use external libraries that solve integrals, isn't it? – Luca Aug 16 '19 at 14:14
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/97494/discussion-between-dave-and-luca). – Dave Aug 16 '19 at 14:15

0 Answers0