Hypothesis Testing on Derived Distributions

Question

Suppose we have access to samples from two probability distributions $P$ and $Q$ which may be arbitrary and high dimensional but are over the same domain $\mathbb{X}$ (for example $P$ and $Q$ may be distributions over $N\times N$ images).

Now suppose we have access to a deterministic oracle $f: \mathbb{X}\mapsto \{0,1\}$. So for any distribution $D$ over $\mathbb{X}$, $f(D)$ is a Bernoulli distribution with some unknown parameter.

We assume that for our oracle $f$ $$f(P)= f(Q)\iff P = Q$$

The main question we'd like to answer is given some observed samples from $P$ and $Q$ can be conclude that $P\neq Q$ with a certain confidence threshold: this is hypothesis testing.

However since $P$ and $Q$ are high dimensional and we may only have access to a sparse set of samples making direct hypothesis testing is not feasible. We instead turn to our oracle $f$.

My question is if we are given enough samples from $f(P)$ and $f(Q)$ that we can state with confidence greater then $1-\alpha$ that $f(P)\neq f(Q)$ (i.e by something like Fishers Exact Test) does that in general say anything about the confidence we have that $P\neq Q$.

My thought is that in general a test for $f(P)\neq f(Q)$ may say arbitrarily little about the true relationship between $P$ and $Q$. I'm wondering perhaps what assumptions (if any) need to be made about $P$ and $Q$ such that high confidence that $f(P)\neq f(Q)$ implies high confidence that $P\neq Q$. I'm also curious if there is any way to compare how sample efficient it is to test $f(P)\neq f(Q)$ for oracles that map to $\{0,1\}$ vs oracles that map onto the reals.

Apologies if this is a very general question. Hopefully if no easy answer can be given someone can point me towards some resources around analyzing statistical tests on high dimensional distributions. Thanks

Hypothesis Testing on Derived Distributions

0 Answers0