Test if multidimensional distributions are the same

Question

Lets say I have two or more sample populations of n-dimensional continuous-valued vectors. Is there a nonparametric way to test if these samples are from the same distribution? If so, is there a function in R or python for this?

The Kolmogorov-Smirnov test is a typical non-parametric tool for testing whether two distributions are the same. I'm not familiar with it, but wikipedia refers to *Justel, A., Peña, D. and Zamar, R. (1997) A multivariate Kolmogorov-Smirnov test of goodness of fit, Statistics & Probability Letters, 35(3), 251-259*. for a multivariate extension of this test. — Macro, Sep 25 '13 at 17:00
There is a CV question addressing this in two dimensions: http://stats.stackexchange.com/questions/25946/goodness-of-fit-for-2d-histograms . Even in two dimensions, there is no standard way to do it. — Flounderer, Sep 26 '13 at 01:32

L Fischman · Answer 1 · 2018-02-14T17:50:07.460

9

I just did a lot of research on multivariate two sample tests when I realized that the Kolmogorov-Smirnov test wasn't multivariate. So I looked at the Chi test, Hotelling's T^2, Anderson-Darling, Cramer-von Mises criterion, Shapiro-Wilk, etc. You have to be careful because some of these tests rely on the vectors being compared to be of the same length. Others are only used to reject the assumption of normality, not to compare two sample distributions.

The leading solution seems to compare the two samples' cumulative distribution functions with all possible orderings which, as you may suspect, is very computationally intensive, on the order of minutes for a single run of a sample containing a few thousand records:

https://cran.r-project.org/web/packages/Peacock.test/Peacock.test.pdf

As Xiao's documentation states, the Fasano and Franceschini test is a variant of the Peacock test:

http://adsabs.harvard.edu/abs/1987MNRAS.225..155F

The Fasano and Franceschini test was specifically intended to be less computationally intensive, but I have not found an implementation of their work in R.

For those of you who want to explore the computational aspects of the Peacock versus Fasano and Franceschini test, check out Computationally efficient algorithms for the two-dimensional Kolmogorov–Smirnov test

edited Feb 14 '18 at 17:50

answered Feb 13 '18 at 17:59

L Fischman

91
1
4

1

What is cumulative distribution for multivariates? – Aksakal Feb 14 '18 at 18:25
3

@Aksakal $F(x, y) = P(X – AdamO Feb 14 '18 at 18:42
3

Nice and concise, AdamO. The Peacock test seems downright silly in not doing pruning, as Fasano and Franceschini do. Let's hope someone decides to code it up one day for R. It's particularly helpful for speed when you have records further decomposed, maybe by a categorical variable, and want to see if your decompositions are in fact drawn from different distributions. – L Fischman Feb 14 '18 at 21:26
**An additional resource in response to @L-Fischman answer** For those of you looking for a **R** solution in the 2D case the Fasano-Franceschini test (1987) -- a 2-D Kolmogorov-Smirnov (KS) two-sample test shown to be a less computationally expensive version of the Peacock test (1983) -- has recently been implemented. The `fasano.franceschini.test` package can be downloaded directly from CRAN. **R code Implementation** [https://nesscoder.github.io/fasano.franceschini.test/](https://nesscoder.github.io/fasano.franceschini.test/) **R code Implementation Manuscript/Documentation** [https://arxiv – Jun 18 '21 at 18:46

score 3 · Answer 2 · answered Apr 01 '19 at 10:36

Yes, there are nonparametric ways of testing if two multivariate samples are from the same joint distribution. I will mention details excluding the ones mentioned by L Fischman. The basic problem you are asking can be called as a 'Two-Sample-Problem' and a good amount of research is going on currently in journals like Journal of Machine Learning Research and Annals of Statistics and others. With my little knowledge on this problem, I can give direction as follows

One recent way of testing the multivariate sample sets is through Maximum Mean Discrepancy (MMD); related literature:Arthur Gretton 2012,Bharath 2010 and others. Other related methods can be found in these research articles. If interested, please go through the articles citing these articles, to get a big picture of the state-of-art in this problem. And YES, for this there are R implementations.

If your interest is to compare various point sets (sample sets) with the reference point set, to see how closely they approximate the reference point set, you can use f-divergence.

One popular special case of this is Kullback-Leibler Divergence. This is used in many machine learning regimes. This can again be done in two np ways; through parzen window (kernel) approach and K-Nearest Neighbor PDF estimators.

There may also be other ways to approach, this answer is in no way a comprehensive treatment of your question ;)

ran8 · Answer 3 · 2018-09-12T20:49:35.137

1

R package np (non-parametric) has a test for equality of densities of continous and categorical data using integrated squared density. Li, Maasoumi, and Racine(2009)

As well as np conditional pdf in section 6.

edited Sep 12 '18 at 20:49

answered Sep 12 '18 at 20:40

ran8

149
8

score 0 · Answer 4 · answered Mar 10 '21 at 14:51

As I am working on the same problem, I can share some of my insights so far (which is far from expertise). You are asking for a test that answers the question whether or not two sample distributions are drawn from the same distribution. A question that is also asked frequently in testing is if two sample distributions are drawn from a distribution with an identical expected value.

In this framework, sample distribution sizes are often referred to as $n_1$ and $n_2$, whereas the dimension of the data is referred to as $p$.

In the test referring to the location only, the hypothesis are: \begin{equation} H_0: \hspace{1cm} \mu_F=\mu_G \\ H_1: \hspace{1cm} \mu_F \neq \mu_G \end{equation} Implementations for $p>>1$ are:

python: https://hotelling.readthedocs.io/en/latest/. I could only access the code via the github repository: https://github.com/dionresearch/hotelling but maybe you are more lucky.

R: https://www.rdocumentation.org/packages/highmean/versions/3.0. The respective paper is: https://academic.oup.com/biomet/article/103/3/609/1744173?login=true

There is a lot of research going on in that area - maybe you want to consider connected papers for a search: https://www.connectedpapers.com/main/3c14196155b1e9def9241a841e359e6054a4d44b/A-Simple-TwoSample-Test-in-High-Dimensions-Based-on-L2Norm/graph

For the first type of test, the hypothesis denotes as the following: \begin{equation} H_0: \hspace{1cm} F(x)=G(x) \\ H_1: \hspace{1cm} F(x)\neq G(x) \end{equation} There have been some approaches in the late 1980s with MST (minimal spanning trees) and nearest neighbor search. For example: https://amstat.tandfonline.com/doi/abs/10.1080/01621459.1986.10478337#.YEjb79wo9EY I think this approach has been dropped, but would be happy to be proven wrong.

score 0 · Answer 5 · answered Aug 18 '21 at 15:12

In summary, this is hard!

So it is useful to step back from the abstract question. Why do you want to compare these distributions? Perhaps this goal can be met some other way.

For example, one reason for doing this is when training a GAN. In this situation the training is iterative and stochastic. So it is sufficient to use a stochastic approximation to the answer, which can be done as follows: each time you want to measure the distance between the distributions, choose a random projection to one dimension. Then calculate the Kolmogorov-Smirnov metric for the two projected distributions.

Apologies, I forget the reference for this method, which was invented by someone smarter than me.

Test if multidimensional distributions are the same

5 Answers5

Linked