I am working in a setup with two binary independent variables. One is experimental, $T$: treated vs. not treated. The other is a feature $F$ that I expect affects how strongly treatment affects outcome. The outcome $S$ is a Bernoulli random variable with very low $p$ (on the order of 0.01-1.0%). I do, however, have at least tens of thousands of trials and hundreds to thousands of successes for each of the 4 subpopulations.
The goal is to compute the effect of treatment $T$ and determine whether this is different depending on $F$. More precisely, I want to compute the lift in the outcome caused by $T$ for each scenario:
\begin{align} \newcommand{\lift}{{\rm lift}} \lift_0 &= \frac{P(S|T=1,F=0)-P(S|T=0,F=0)}{P(S|T=0,F=0)} \\[10pt] \lift_1 &= \frac{P(S|T=1,F=1)-P(S|T=0,F=1)}{P(S|T=0,F=1)} \end{align}
Based on this question, I can compute whether each lift is significantly different from zero. But I'd like to take this a step further and determine whether the lifts are statistically different from each other. How can I think about this problem? It seems there may be some connection to difference-in-differences, but I'm computing lift (not difference) and my two control groups are not necessarily similar, so I'm not sure how well that applies.
A concrete example may help the discussion, so here are some numbers to work from:
+---------+-----------+---------+-----------+--------+
| Feature | Treatment | Trials | Successes | p |
+---------+-----------+---------+-----------+--------+
| No | No | 4169157 | 1064 | 0.026% |
| No | Yes | 2892839 | 794 | 0.027% |
| Yes | No | 577625 | 951 | 0.165% |
| Yes | Yes | 823158 | 2260 | 0.275% |
+---------+-----------+---------+-----------+--------+
The two lifts therefore are $\lift_0=7.5$% and $\lift_1=67$%. At what significance level are these different?