How to check whether two image datasets come from the same distribution?

Question

In the literature of transfer learning and domain adaptation everyone talks about two datasets having different feature spaces and different distributions. In case of having image datasets, I think I understand what they mean by a difference in feature space: it is basically the dimensions of our images. (For example, if each image is greyscale and 4x4, it means that we have 16 dimensions. A 5x5 RGB image has $5\cdot5\cdot3=75$ dimensions.).

What confuses me is the distribution of image datasets. If I have a dataset of 1000 greyscale images and each image has a dimension of 4x4=16, then we have 1000 points in a feature space with 16 dimensions. Do I understand correctly that by using these points, we can then estimate the underlying distribution we are sampling from?

Secondly, in case one needs to check whether two image datasets come from different distributions, how can this be achieved?

I appreciate your guidance.

score 0 · Answer 1 · answered Aug 03 '21 at 21:12

Your understanding is exactly correct: if you have 1000 gray-scale images of dimensions 4x4, then those are 1000 samples points in a 16-dimensional space. You can then try to use these 1000 images to estimate the density of the underlying distribution, although accurate density estimation in high-dimensional spaces is generally a hard problem.

I have discussed some methods to detect / quantify differences between such sampling distributions in this answer.

How to check whether two image datasets come from the same distribution?

1 Answers1