2

Out of distribution and shifting data distribution are two types of dataset shift 1, I can understand what out-of-distribution means but not what shifting data distributions are. In that blog an example of OOD is given as follow:

For example, consider what happens when a cat-versus-dog image classifier is shown an image of an airplane.

In the $X \rightarrow Y$ problem, the airplane is out of the distribution of Y, that is $\text{airplane}\not\in \{Y\}$, but when it comes to distribution shift I cannot get my head around. I wonder if it means the ratio of cats to dogs is different between that in the training set and that in the test set?

I have done some digging and thought the shifting data distribution is the same as covariate shift, and I read this blog and get the following definition:

Covariate shift appears only in $X \rightarrow Y$ problems, and is defined as the case where $P_{tra}(y|x) = P_{tst}(y|x)$ and $P_{tra}(x)\neq P_{tst}(x)$.

I wonder how we can measure the $P(x)$ since any case would be different from each other and that would only be uniform distributions?

To be more concrete, if we draw two samples randomly from an image corpus containing dogs and cats the two samples follow the same distribution(in fact, I am not certain). But if we don't know if the two samples are randomly drawn from the corpus or not how can we measure the two distributions are not the same since every two images in that corpus would be different?

Does detecting covariate shift amount to making sure if the two samples are drawn randomly from the population since otherwise the two distributions(from some perspectives we are not sure if they are important features to predict $Y$) would be different?

In this representation, I learned three methods to detect covariate shift: visualization, membership modeling, and uncertainty quantification but I don't know how they relate to the distribution.

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52

2 Answers2

0

I think this problem is unanswerable and here is what the authors of Dataset Shift in Machine Learning said about the distribution of the independent variables in section 2.2:

distributions are simply representations of our state of belief ... I shall keep to the standard way of speaking throughout the chapter, as though distributions are things belonging to the world, bearing in mind that there are ways of rephrasing matters to keep the de Finettian happy.

The dimensions of the distribution would include everything but we can exclude some irrelevant factors.

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52
0

You correctly pointed out that to detect covariate shift, you need to compare the distributions $p_{\text{train}}(x)$ and $p_{\text{test}}(x)$. There is - as far as I know - not a simple one-size-fits-all solution to this problem, but there are quite a few tools you can use.

To clear up one confusion first: of course, you cannot judge whether the two distributions are identical or different based on individual samples. To stay with your example, you cannot draw one image from either distribution and then based on that state that the distributions are identical or different. To make statements about the two distributions and their relationship, you need to draw many samples, so that you can actually observe the whole distributions with sufficient accuracy. In high-dimensional spaces (as with images, for example), this really might mean a lot of samples.

In simple problems with 1-2 dimensions (to some degree also in 3 dimensions), you can actually look at histograms or kernel density estimates which will simply show you whether the distributions are similar or different. In high dimensional spaces, looking at such graphs (and computing kernel density estimates) becomes impractical, unfortunately.

Generally, distance measures between distributions, such as the very popular Kullback-Leibler divergence or the Wasserstein metric are the theoretical tool of choice. Of course, to apply these distance measures, you first need a representation of the two distributions. Unfortunately, accurate density estimation in high-dimensional spaces (such as image spaces) is a challenging problem in and of itself, but there are many methods proposed in the literature.

One very practical "engineering approach" is to train a classifier to distinguish between the training and test samples. The separation performance of the classifier can then serve as an indicator of the presence and strength of covariate shift. See, e.g., this paper.

Finally, also cf. this very similar question on datascience.SE.

jhin
  • 749
  • 4
  • 12