0

I have N random samples $x_i$ from an unknown probability distribution $A$, and 1 random sample $y$ from another unknown distribution $B$. The distributions can be assumed to be continuous and well-behaved. The null hypothesis $H_0$ is that $A$ and $B$ are the same distribution. I am interested in testing whether $y$ is large enough to conclude that $H_0$ is unlikely. Based on the above criterion, I believe the value I want to calculate is $$p = P[X \geq y | H_0]$$ namely, the probability of randomly drawing from $A$ a value that is at least as large as $y$.

I have tried:

  • Defining a bernoulli random variable $z_i = x_i\geq y$, and estimating $p$ from the resulting binomial distribution. It works fine, but it does not take account of the magnitudes of $x$ and $y$, only their rank, so maybe it is possible to do better?
  • Upper-bounding $p$ using sample Chebyshev's inequality. Again, it works, but seems quite conservative (needs $y$ to be like 15$\sigma$ away from $x$ to achieve $p\leq 0.01$)

What other options exist? Is it possible to improve the situation by some small additional knowledge about $A$ (e.g. unimodality)?

Aleksejs Fomins
  • 1,499
  • 3
  • 18
  • Ordinarily a p-value is based on a *statistic,* but your probability expression is not a statistic. Thus, it looks like you need to ask about both a suitable statistic and how to use it to conduct this test. Typically the test is conducted by erecting a [prediction limit](https://stats.stackexchange.com/search?q=%22prediction+interval%22) for $Y$ based on the $X_i.$ – whuber Dec 09 '19 at 17:56
  • Can't I use the r.v. $X$ itself as a test statistic? Also, I think I have found a possible answer to my second question here https://stats.stackexchange.com/questions/82419/does-a-sample-version-of-the-one-sided-chebyshev-inequality-exist – Aleksejs Fomins Dec 09 '19 at 18:23
  • If you only have one observation for $Y$, your sample size is effectively $1$ even if you know the theoretical distribution for $X$. Consequently, the central limit theorem can't be relied on. – jbowman Dec 09 '19 at 18:44
  • @jbowman I don't follow. If I know the theoretical distribution for X, I can evaluate how likely is it that a single observation Y came from that distribution, can't I? I only need CLT to determine the distribution of X, for which I have multiple observations – Aleksejs Fomins Dec 09 '19 at 19:23
  • You never observe the random variable: all you have is a realization. In particular, you cannot ever directly observe any probability associated with a random variable. – whuber Dec 09 '19 at 19:24
  • What does the CLT have to do with the distribution of $X$? $X$ could have a Gamma distribution or a Binomial distribution or whatever, and that will remain true regardless of the sample size... just having a large sample size won't make $X$ Normally distributed! – jbowman Dec 09 '19 at 19:24
  • @jbowman Of course, CLT only affects the distribution of the mean. I should take a nap, I'm not making sense any more. I'll review the question tomorrow – Aleksejs Fomins Dec 09 '19 at 19:29

0 Answers0