I am going to try to explain what is going on and maybe you can decide for youself if the sample sizes need to be the same.
A (simple) hypothesis testing problem has roughly three ingredients:
- A null hypothesis $H_0$, which is a distribution for the data, say $F_0$.
- An alternative $H_1$, which is another distribution for the data, say $F_1$.
- A test (a procedure) to tell the two apart based on the data. This is often in the form of forming a test statistic based on the data and comparing it with a threshold.
Assume that we just have a single observation (a single data point) $X$, which will have distribution $F_0$ under the null and $F_1$ under the alternative. Suppose our test is to reject the null in favor of the alternative, if $X > \tau$ for a fixed threshold $\tau$.
The performance of the test at the population level can be determined by two numbers,
- $\alpha(\tau) := \mathbb P_0(X > \tau)$ called the probability of Type I error, also known as the false positive rate (FPR),
- $1-\beta(\tau) := \mathbb P_1(X > \tau)$ called detection probability or 1 minus probability of type II error or true positive rate (TPR).
A plot of $\mathbb P_1(X > \tau)$ versus $\mathbb P_0(X > \tau)$ is called the ROC curve and tells you everything you want to know about the performance of the test at the population level.
How can we measure these empirically? We can sample $Z_1,\dots,Z_n \sim F_0$ and $Y_1,\dots,Y_p \sim F_1$ and then form
$$
\mathbb P_0(X > \tau) \approx \frac1n \sum_{i=1}^n 1\{Z_i > \tau\}, \quad
\mathbb P_1(X > \tau) \approx \frac1p \sum_{j=1}^p 1\{Y_j > \tau\}
$$
You can vary $\tau$ over the entire real line and then make an approximate plot of $\mathbb P_0(X > \tau)$ versus $\mathbb P_1(X > \tau)$. A bit of thought also shows that you really don't need to compute these for all values of $\tau$. Instead, you can sort the values of $\{Z_i\}$, say $Z_{(1)} \le Z_{(2)} \le \dots \le Z_{(n)}$ and only evaluate the two sums at these breakpoints.
But you may say: Didn't you read the question? I want to compute the FDR. What does this all have to do with FDR. OK, let's try to compute the FDR in our simulation setup using the so-called "definition". We have
- The total number of false positives, FP = $\sum_{i=1}^n 1\{X_i > \tau\}$.
- The total number of true positives, TP = $\sum_{j=1}^p 1\{Y_j > \tau\}$.
Then, by "definition", we have
\begin{align*}
\text{FDR} = \frac{FP}{FP + TP} &= \frac{\sum_{i=1}^n 1\{X_i > \tau\}}{\sum_{i=1}^n 1\{X_i > \tau\} + \sum_{j=1}^p 1\{Y_j > \tau\}} \\
&= \frac{n \frac1n\sum_{i=1}^n 1\{X_i > \tau\}}{n \frac1n\sum_{i=1}^n 1\{X_i > \tau\} + p \frac1p\sum_{j=1}^p 1\{Y_j > \tau\}} \\
&\stackrel{\approx}{\to} \frac{n \alpha(\tau)}{n \alpha(\tau) + p (1-\beta(\tau))}
\end{align*}
as $n,p \to \infty$. So you can get any value$^*$ of FDR that you like (!) by picking the ratio of $n$ to $p$. That is, FDR is not a well-defined quantity in this case. ($^*$ any value in $(0,1)$).
TL;DR FDR doesn't really make much sense when testing a single hypothesis $H_0$ versus $H_1$. It makes sense if you are testing many hypotheses, say, you are testing $H_{0,i}$ versus $H_{1,i}$ for $i=1,\dots,m$. Then there is a well-defined FDR at the population level.
Caveat: In other words, if you are a frequentist, for a single hypothesis test, there is no well-defined FDR at the population level (or true FDR if you will) that you can estimate using simulations. Now, if you are a Bayesian, it seems that there is a well-defined version. If you are a Bayesian, you know the ratio $n/p$ asymptotically. That is, if you are willing to accept some prior probabilities of observing a positive or a negative sample, then there is a well-defined FDR.
EDIT: Here are some more thoughts upon further clarification by the OP.
The original idea of Benjamini–Hochberg was to devise a procedure such that among a fixed collection of many hypotheses (say $m$ pairs of genes), you are guaranteed to not make more than 5% false discoveries. This "in general$^\dagger$" cannot be achieved by a fixed threshold, for all $m$. If you look at the BH procedure, it involves $m$, the total number of hypotheses involved. ($^\dagger$ It can be done, for example, if all the samples are coming from just two distributions and you know the eventual proportions of positive and negative samples. See the caveat.)
If you want to fix a threshold, and keep applying your test to many pairs, (assuming that they are all coming from either of two simple hypotheses which is quite a strong assumption), you will have an FDR, per the caveat above, which will eventually depend on the true proportion of the positives and negatives in the population (that is, $\mathbb P(H = 0)$ and $\mathbb P(H=1)$), as well as the type I ($\alpha$) and type II errors ($\beta$) of the test.
This should eventually answer the original question of what sample sizes needed. Per that caveat, if you are willing to assume all the pairs of correlations among genes are coming just from two distributions, then to control the FDR your ratio of samples from the two distributions should be proportional to the true proportions of positive and negative examples in reality (which perhaps you don't know).