0

Suppose we have access to samples from two probability distributions $P$ and $Q$ which may be arbitrary and high dimensional but are over the same domain $\mathbb{X}$ (for example $P$ and $Q$ may be distributions over $N\times N$ images).

Now suppose we have access to a deterministic oracle $f: \mathbb{X}\mapsto \{0,1\}$. So for any distribution $D$ over $\mathbb{X}$, $f(D)$ is a Bernoulli distribution with some unknown parameter.

We assume that for our oracle $f$ $$f(P)= f(Q)\iff P = Q$$

The main question we'd like to answer is given some observed samples from $P$ and $Q$ can be conclude that $P\neq Q$ with a certain confidence threshold: this is hypothesis testing.

However since $P$ and $Q$ are high dimensional and we may only have access to a sparse set of samples making direct hypothesis testing is not feasible. We instead turn to our oracle $f$.

My question is if we are given enough samples from $f(P)$ and $f(Q)$ that we can state with confidence greater then $1-\alpha$ that $f(P)\neq f(Q)$ (i.e by something like Fishers Exact Test) does that in general say anything about the confidence we have that $P\neq Q$.

My thought is that in general a test for $f(P)\neq f(Q)$ may say arbitrarily little about the true relationship between $P$ and $Q$. I'm wondering perhaps what assumptions (if any) need to be made about $P$ and $Q$ such that high confidence that $f(P)\neq f(Q)$ implies high confidence that $P\neq Q$. I'm also curious if there is any way to compare how sample efficient it is to test $f(P)\neq f(Q)$ for oracles that map to $\{0,1\}$ vs oracles that map onto the reals.

Apologies if this is a very general question. Hopefully if no easy answer can be given someone can point me towards some resources around analyzing statistical tests on high dimensional distributions. Thanks

user2757771
  • 131
  • 1
  • 3

0 Answers0