3

Let's say one has a finite collection of i.i.d. samples from an unknown source distribution $S=\{x_{i} | i \in [1,n_{S}], x_{i} \sim p_{X_{S}}(x)\}$. Where each $x$ is multidimensional and has continuous and discrete components.

One also has another finite collection of i.i.d. samples from an unknown target distribution $T=\{x_{j} | i \in [1,n_{T}], x_{j} \sim p_{X_{T}}(x) \in \mathbb{R}^{d}\}$ with matching features (same dimensions and same levels/range for each component).

What would be a mathematical method to optimally subsample $S$ into a collection $S'$ so that $S'$ and $T$ are the closest statistically (in whatever mathematical sense where this problem is the easiest : E.M distance/KL, etc.) ?

All I can think of is some kind of rejection sampling method but the details of it are not clear to me.

Is it possible to add some constraints over $S'$ like its size ? (sequential sampling method ?)

I am looking for ideas or pointers towards literature.

jeandut
  • 136
  • 5
  • 15
  • Sample S or select a subset of S? If you are sampling S you cannot choose the elements, and thus the only thing you can control (on expectation) is the size of the subsample. If you are selecting a subset of S you would have much more control (I think) – Jacques Wainer Nov 09 '20 at 15:03
  • By subsampling I meant either subsample S in a probabilistic fashion or select a subset of S deterministically. I am not familiar with the literature so I don't know which way is simpler/makes more sense. – jeandut Nov 10 '20 at 20:12

0 Answers0