Generate Variables from Causal Structure for Simulation with Binary/Indicators

Question

I would like to generate a data set of variables from a specific causal structural (a stylized world) for simulation, similar to this answer, but most of the key variables are binary/indicators.

Specifically, my example wants to capture the relationship between example variables surgery, clinical variables, and patient race from the following structural model:

I will then use the created data to demonstrate confounding applied to this setting. However, each variable, except 'appropriate', is an indicator variable i.e. 1 if the true and 0 if false, which makes setting up the systems of equations tricky to make into simple linear combinations.

How should I create these equations?

To show an attempt in R, I tried to create the structural equations for this stylized world as such:

Black <- rbinom(reps, 1, .5)
U <- rnorm(reps)
Appropriate <- (1/(1+exp(-U)))
ST_Elev <- as.numeric(2*Black + 3*Appropriate + rnorm(reps) > 2)
LowSES <- as.numeric(3*Black+rnorm(reps)>2.5)
Surgery <- as.numeric(5*Appropriate+LowSES+2*ST_Elev+rnorm(reps) > 3.5)

world <- data.frame(Black ,
                    ST_Elev,
                    LowSES,
                    Appropriate,
                    Surgery)

by just creating indicators with the right proportions based on being greater than some value. But this formation doesn't nicely follow with my (limited) understanding of structural causal models and I can't figure out how to control the relationships as well directly.

Carlos Cinelli · Accepted Answer · 2018-12-26T00:36:45.897

How should I create these equations?

There is no single right way to simulate the equations: any parametric form you choose for the structural equations is a valid model, as long as they satisfy the assumptions encoded in the DAG: exclusions restrictions of what is a direct cause of what, and independence restrictions of which error terms are independent. In fact, your approach is fine, you are simulating the structural equations with the latent variable representation of probits.

and I can't figure out how to control the relationships as well directly

Since all variables are binary, you can also easily simulate a linear probability model or a logit. Here I will show an example using a linear probability model, maybe it will be easier for you to control exactly the conditional probabilities you want in the simulation:

reps <- 1e3
Black <- rbinom(reps, 1, .5)
Appropriate <- rbinom(reps, 1, .5)
ST_Elev <- rbinom(reps, 1, .1 + .1*Black + .2*Appropriate)
LowSES <- rbinom(reps, 1, .15+ .3*Black)
Surgery <- rbinom(reps, 1, .2*Appropriate + .1*LowSES + .2*ST_Elev)

Generate Variables from Causal Structure for Simulation with Binary/Indicators

1 Answers1