Discover causal effects using OLS: does the treated and not treated group need to be similar on all observed variables?

Question

I have one dummy variable, $D$, which equals 1 if the subject received treatment and $0$ otherwise.

My outcome of interest is $Y$. For example, $D$ tells me whether the subject took the drug or a placebo and $Y$ is a continuous variable measuring pain. I want to discover whether taking the drug reduces pain.

I have other variables that measure some features of the subjects, let's call them $X_1$ and $X_2$. For example, $X_1$ is the age of the subject and $X_2$ is the amount of physical activity the subject does each day.

By a t-test I discover that the mean of $X_1$ is different between the treated and not treated group, and the mean of $X_2$ is different between the two groups as well.

So I cannot use a naive estimator, and I understand that. It may be that group 1 experience less pain because the subjects in that group are younger, not because of my drug.

But if I write: $$Y = \beta_0 + \beta_1 D + \beta_2 X_1 + \beta_3 X_2$$ and run an OLS on it, will $\beta_1$ be the effect I am looking for?

Is this model correctly specified?

Yes, $X_1$ and $X_2$ are different between the two groups: the two groups are not the same (so there is no randomization). But, I'm controlling for the difference. I put the variables in the model, so I am accounting for the difference between the two groups.

Would this model work?

score 3 · Accepted Answer · answered Apr 30 '21 at 10:52

3

No they do not need to be similar, if you control for that variables, as you did. That is the whole point of using control variable apart from the dummy that you are interested in.

answered Apr 30 '21 at 10:52

Charge

65
4

Thank you. So what's the point of Propensity Score Matching? – robertspierre Apr 30 '21 at 10:52
that's another story. What I'm saying is that if you want to analyze the effect of a treatment by comparing two sample, where one received the treatment and another didn't, you don't need the two samples to have the same specific characteristics. Of course, there are other assumptions that you need to evaluate. – Charge Apr 30 '21 at 10:59
ok i'll open another question about Propensity Score Matching. Thanks – robertspierre Apr 30 '21 at 11:03

score 2 · Answer 2 · answered Apr 30 '21 at 12:31

2

You used causal effects in the topic title but did not provide any background information that would make us believe that you are using a causal design. If you did not randomize the exposures, and you have no believable causal diagram, the best you can do is to estimate an association that has accounted for known factors. And you have postulated a regression model that is unlikely to fit the data as you are assuming that covariate effects of continuous predictors are linear. And be sure to pre-specify the model. This is not a place for stepwise analysis, otherwise the inference is badly distorted.

answered Apr 30 '21 at 12:31

Frank Harrell

74,029
5
148
322

The design could be this: the doctor send patients with pain to us. We administer the drug. Afterwards the patient go back to the doctor and have pain level measured. All patients go back to have the pain measured, but not all patient come to us for the drug. The ones that come to us are older ($X_1$) and less physically active ($X_2$). The young patient don't have the drug, neither the physically active. If $D$ is a dummy which equals 1 if the patient was given the drug, and $Y$ is pain level, does an OLS $Y=D+X_1+X_2$ will uncover the causal effect of the drug on the pain level? – robertspierre Apr 30 '21 at 12:36
That has nothing to do with uncovering causal effects except under the highly restrictive assumptions that (1) physical activity is measured as a continuous variable and has exceptionally high accuracy and (2) physical activity and age are the **only** factors used for treatment selection. I suggest some course work. – Frank Harrell Apr 30 '21 at 12:44

Discover causal effects using OLS: does the treated and not treated group need to be similar on all observed variables?

2 Answers2

Linked