3

I know there are plenty of previous Cross Validated posts regarding if one must include the main effect term if the interaction term is included. The general consensus is: no if you have a very good reason not to. I wonder if this is also true in the context of subgroup analysis.

Suppose I have a randomized trial with half randomized to treatment $T=0$ and half to $T = 1$. I believe a moderator $M \in \{0, 1\}$ has a moderator effect within only the treated $T=1$ arm but not the control $T=0$ arm, and I have substantial previous evidence to support this claim. I propose the model

\begin{align*} Y = \beta_0 + \beta_1T + \beta_2TM \end{align*}

We see that \begin{align*} [\text{Treatment effect within } M=0] &= \mathbb{E}[Y|T=1,M=0] - \mathbb{E}[Y|T=0,M=0] \\ &= \beta_1 \\ [\text{Treatment effect within } M=1] &= \mathbb{E}[Y|T=1,M=1] - \mathbb{E}[Y|T=0,M=1] \\ &= \beta_1 + \beta_2 \\ [\text{Moderator effect within } T=0] &= \mathbb{E}[Y|T=0,M=1] - \mathbb{E}[Y|T=0,M=0] \\ &= 0 \\ [\text{Moderator effect within } T=1] &= \mathbb{E}[Y|T=1,M=1] - \mathbb{E}[Y|T=1,M=0] \\ &= \beta_2 \end{align*} To me, it seems like the model captures my aformentioned ideas pretty well. Is there anything wrong with this in the context of subgroup analysis?

Tom Chen
  • 407
  • 2
  • 12
  • It seems odd that a moderator could only exist in one arm of an RCT. By definition, such factors should be similar in both treatment groups. – Todd D Sep 15 '21 at 01:27

1 Answers1

2

There isn't anything inherently wrong with your model. However, the model makes a pretty strong assumption regarding the relationship between variables. Consider the alternative model $$Y = \beta_0 + \beta_1 T + \beta_2 T M + \beta_3 M$$ Like you wrote out, your model assumes that $\beta_3 = 0$. As long as that is true, then your model will give you unbiased results. However, if $\beta_3 \ne 0$ then your model will produce misleading results because you are placing a modeling restriction that the data does not adhere to. Basically, your model is unable to capture the true density of the data generating mechanism.

The advantage of the model estimating $\beta_3$ is that we don't need to assume $\beta_3 = 0$. The model will provide an unbiased answer whether $\beta_3$ is equal to zero or not.

Below is a simple simulation study written using Python 3.5.2 demonstrating this.

Scenario 1 $\beta_3 = 0$

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Setting simulation parameters
n = 1000000
sample_size = 500
runs = 1000

# Generating data
df = pd.DataFrame()
df['M'] = np.random.binomial(n=1, p=0.4, size=n)
df['T'] = np.random.binomial(n=1, p=0.5, size=n)
df['Yt1'] = 120 + 5*1 + df['M']*10*1 + np.random.normal(size=n)
df['Yt0'] = 120 + 5*0 + df['M']*0*1 + np.random.normal(size=n)
df['Y'] = np.where(df['T'] == 1, df['Yt1'], df['Yt0'])

# Calculating true values based on full population
truth_m0 = 5
truth_m1 = 5 + 10

# Objects to store results
bias_f1_m0 = []
bias_f1_m1 = []
bias_f2_m0 = []
bias_f2_m1 = []

for i in range(runs):
    dfs = df.sample(n=sample_size, replace=False)

    # Assuming beta_3 == 0
    m1 = smf.ols('Y ~ T + T:M', data=dfs).fit()
    bias_f1_m0.append(m1.params['T'] - truth_m0)
    bias_f1_m1.append(m1.params['T'] + m1.params['T:M'] - truth_m1)

    # No assumption for beta_3
    m2 = smf.ols('Y ~ T + T:M + M', data=dfs).fit()
    bias_f2_m0.append(m2.params['T'] - truth_m0)
    bias_f2_m1.append(m2.params['T'] + m2.params['T:M'] - truth_m1)

For the model excluding $\beta_3$, bias was 0.005 for both $M=0$ and $M=1$ strata. For the model including $\beta_3$, the bias was 0.002 for $M=0$ and 0.009 for $M=1$ strata.

Scenario 2 $\beta_3 \ne 0$

For scenario 2, I used the same code but changed the data generating mechanism to include a $\beta_3$ term in the model. Below is the data generating mechanism

# Generating other data
df = pd.DataFrame()
df['M'] = np.random.binomial(n=1, p=0.4, size=n)
df['T'] = np.random.binomial(n=1, p=0.5, size=n)
df['Yt1'] = 120 + 5*1 + df['M']*10*1 + df['M']*7 + np.random.normal(size=n)
df['Yt0'] = 120 + 5*0 + df['M']*0*1 + df['M']*7 + np.random.normal(size=n)
df['Y'] = np.where(df['T'] == 1, df['Yt1'], df['Yt0'])

For the model excluding $\beta_3$, bias for the $M=0$ strata was -2.798 and 4.198 for the $M=1$ strata. The model including $\beta_3$, was unbiased ($M=1$: 0.001, $M=1$: -0.008)

As these two scenarios demonstrate, excluding $\beta_3 M$ from your model is fine as long as the assumption that $\beta_3 = 0$ is true. If it is not, your model will potentially give biased results

pzivich
  • 1,430
  • 1
  • 5
  • 15