Empirical FDR from real and null distributions -- do sample sizes need to match?

Question

Suppose I have a set of values (correlation, in this case) that come from a real distribution, and another set of values that are from a simulated null distribution.

To calculate the value that corresponds to FDR = 0.05, I've followed what @StupidWolf has suggested here, which can be summarized as labeling each value as originating from real or null, ordering them in descending order, then creating a running tab of the # of values that come from the null distribution vs. # number of values from both the null and real.

However, it occurs to me that the size of these two distributions will not be equal as they are being calculated as of now. Do I need to subsample the larger of these distributions to match the size of the smaller distribution in order for the FDR = 0.05 value found in this way to be valid?

Clarification that appears in comments:

To give more context, these two distributions represent correlations between genes. Each correlation value could be tested against the null, i.e. is a given pair of gene's correlation greater than expected by chance? I can assign a p-value per correlation value (many hypotheses), but given that some proportion of significant p-values are found given that the null is true, I want to correct for these using FDR

passerby51 · Accepted Answer · 2021-01-03T00:31:56.470

I am going to try to explain what is going on and maybe you can decide for youself if the sample sizes need to be the same.

A (simple) hypothesis testing problem has roughly three ingredients:

A null hypothesis $H_0$, which is a distribution for the data, say $F_0$.
An alternative $H_1$, which is another distribution for the data, say $F_1$.
A test (a procedure) to tell the two apart based on the data. This is often in the form of forming a test statistic based on the data and comparing it with a threshold.

Assume that we just have a single observation (a single data point) $X$, which will have distribution $F_0$ under the null and $F_1$ under the alternative. Suppose our test is to reject the null in favor of the alternative, if $X > \tau$ for a fixed threshold $\tau$.

The performance of the test at the population level can be determined by two numbers,

$\alpha(\tau) := \mathbb P_0(X > \tau)$ called the probability of Type I error, also known as the false positive rate (FPR),
$1-\beta(\tau) := \mathbb P_1(X > \tau)$ called detection probability or 1 minus probability of type II error or true positive rate (TPR).

A plot of $\mathbb P_1(X > \tau)$ versus $\mathbb P_0(X > \tau)$ is called the ROC curve and tells you everything you want to know about the performance of the test at the population level.

How can we measure these empirically? We can sample $Z_1,\dots,Z_n \sim F_0$ and $Y_1,\dots,Y_p \sim F_1$ and then form $$ \mathbb P_0(X > \tau) \approx \frac1n \sum_{i=1}^n 1\{Z_i > \tau\}, \quad \mathbb P_1(X > \tau) \approx \frac1p \sum_{j=1}^p 1\{Y_j > \tau\} $$ You can vary $\tau$ over the entire real line and then make an approximate plot of $\mathbb P_0(X > \tau)$ versus $\mathbb P_1(X > \tau)$. A bit of thought also shows that you really don't need to compute these for all values of $\tau$. Instead, you can sort the values of $\{Z_i\}$, say $Z_{(1)} \le Z_{(2)} \le \dots \le Z_{(n)}$ and only evaluate the two sums at these breakpoints.

But you may say: Didn't you read the question? I want to compute the FDR. What does this all have to do with FDR. OK, let's try to compute the FDR in our simulation setup using the so-called "definition". We have

The total number of false positives, FP = $\sum_{i=1}^n 1\{X_i > \tau\}$.
The total number of true positives, TP = $\sum_{j=1}^p 1\{Y_j > \tau\}$. Then, by "definition", we have \begin{align*} \text{FDR} = \frac{FP}{FP + TP} &= \frac{\sum_{i=1}^n 1\{X_i > \tau\}}{\sum_{i=1}^n 1\{X_i > \tau\} + \sum_{j=1}^p 1\{Y_j > \tau\}} \\ &= \frac{n \frac1n\sum_{i=1}^n 1\{X_i > \tau\}}{n \frac1n\sum_{i=1}^n 1\{X_i > \tau\} + p \frac1p\sum_{j=1}^p 1\{Y_j > \tau\}} \\ &\stackrel{\approx}{\to} \frac{n \alpha(\tau)}{n \alpha(\tau) + p (1-\beta(\tau))} \end{align*} as $n,p \to \infty$. So you can get any value$^*$ of FDR that you like (!) by picking the ratio of $n$ to $p$. That is, FDR is not a well-defined quantity in this case. ($^*$ any value in $(0,1)$).

TL;DR FDR doesn't really make much sense when testing a single hypothesis $H_0$ versus $H_1$. It makes sense if you are testing many hypotheses, say, you are testing $H_{0,i}$ versus $H_{1,i}$ for $i=1,\dots,m$. Then there is a well-defined FDR at the population level.

Caveat: In other words, if you are a frequentist, for a single hypothesis test, there is no well-defined FDR at the population level (or true FDR if you will) that you can estimate using simulations. Now, if you are a Bayesian, it seems that there is a well-defined version. If you are a Bayesian, you know the ratio $n/p$ asymptotically. That is, if you are willing to accept some prior probabilities of observing a positive or a negative sample, then there is a well-defined FDR.

EDIT: Here are some more thoughts upon further clarification by the OP.

The original idea of Benjamini–Hochberg was to devise a procedure such that among a fixed collection of many hypotheses (say $m$ pairs of genes), you are guaranteed to not make more than 5% false discoveries. This "in general$^\dagger$" cannot be achieved by a fixed threshold, for all $m$. If you look at the BH procedure, it involves $m$, the total number of hypotheses involved. ($^\dagger$ It can be done, for example, if all the samples are coming from just two distributions and you know the eventual proportions of positive and negative samples. See the caveat.)

If you want to fix a threshold, and keep applying your test to many pairs, (assuming that they are all coming from either of two simple hypotheses which is quite a strong assumption), you will have an FDR, per the caveat above, which will eventually depend on the true proportion of the positives and negatives in the population (that is, $\mathbb P(H = 0)$ and $\mathbb P(H=1)$), as well as the type I ($\alpha$) and type II errors ($\beta$) of the test.

This should eventually answer the original question of what sample sizes needed. Per that caveat, if you are willing to assume all the pairs of correlations among genes are coming just from two distributions, then to control the FDR your ratio of samples from the two distributions should be proportional to the true proportions of positive and negative examples in reality (which perhaps you don't know).

Apologies, I'm afraid I don't really follow what scenario then would be appropriate for computing FDR in the way I describe. And, to clarify, are you saying that by manipulating the sample sizes of the H1 and H0 distributions, the value corresponding to FDR = 0.05, for instance, can vary wildly? — Rebecca Eliscu, Jan 02 '21 at 23:22
I am saying FDR doesn't make sense if you are testing a single hypothesis. Just think about the problem you are trying to solve. Does it involve a single hypothesis, or you are testing many hypotheses at the same time? The idea of FDR was developed to address the second type of problem. — passerby51, Jan 02 '21 at 23:23
Again, apologies for my lack of understanding. To give more context, these two distributions represent correlations between genes. Each correlation value could be tested against the null, i.e. is a given pair of gene's correlation greater than expected by chance? I can assign a p-value per correlation value (many hypotheses), but given that some proportion of significant p-values are found given that the null is true, I want to correct for these using FDR. Is there a different approach that I should take if this is what I'm after? — Rebecca Eliscu, Jan 02 '21 at 23:35
So, are you fixing a pair of genes and asking whether the correlation between those two genes is by chance? Or you have say 1000 pairs of genes and want to know how many of the correlations among these 1000 pairs could be due to chance? BTW, I revised my answer. There is a way of making sense of FDR for a single hypothesis testing problem if you are willing to be a Bayesian. — passerby51, Jan 02 '21 at 23:46
I think what I'm after is the latter. I have a dataset with ~30k genes, and calculate the pairwise correlation between each of them (and similarly, but with permuted data for the null x100). I want to be able generalize about similar datasets, such we can can single out any pair of genes, calculate the correlation, have some confidence that it was not arrived at by chance. But given that I can assign a p-value to each of these pairwise correlations, I'm inclined to think this is the multiple hypothesis situation that the frequentist approach is contingent on, no? — Rebecca Eliscu, Jan 02 '21 at 23:52
@RebeccaEliscu, OK, the FDR control ideas were designed for more or less this same situation, but the ideas are subtle. Since my comment got long, I added it to the response. — passerby51, Jan 03 '21 at 00:06
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/117964/discussion-between-rebecca-eliscu-and-passerby51). — Rebecca Eliscu, Jan 03 '21 at 00:06

Empirical FDR from real and null distributions -- do sample sizes need to match?

1 Answers1