I have one dummy variable, $D$, which equals 1 if the subject received treatment and $0$ otherwise.
My outcome of interest is $Y$. For example, $D$ tells me whether the subject took the drug or a placebo and $Y$ is a continuous variable measuring pain. I want to discover whether taking the drug reduces pain.
I have other variables that measure some features of the subjects, let's call them $X_1$ and $X_2$. For example, $X_1$ is the age of the subject and $X_2$ is the amount of physical activity the subject does each day.
By a t-test I discover that the mean of $X_1$ is different between the treated and not treated group, and the mean of $X_2$ is different between the two groups as well.
So I cannot use a naive estimator, and I understand that. It may be that group 1 experience less pain because the subjects in that group are younger, not because of my drug.
But if I write: $$Y = \beta_0 + \beta_1 D + \beta_2 X_1 + \beta_3 X_2$$ and run an OLS on it, will $\beta_1$ be the effect I am looking for?
Is this model correctly specified?
Yes, $X_1$ and $X_2$ are different between the two groups: the two groups are not the same (so there is no randomization). But, I'm controlling for the difference. I put the variables in the model, so I am accounting for the difference between the two groups.
Would this model work?