I have a question about propensity score matching and how it is used in analyzing non-randomized datasets. I understand how it is performed (I think!), but not why I should use it and not something else. Let me illustrate what I mean with an example.
Assume I have a dataset with a treatment (x, binary variable), an outcome (y, numeric) and one covariate (z, numeric). The task is to determine the effect of the treatment (x) on the output (y). Treatment was however not randomly administered, so we need to take the covariate (z) into account. A real world example would be y=sick days per year, x=some drug and z=age.
If I understand correctly, this is how you (can) go about if you apply propensity score matching:
- Predict treatment x from the covariate z, e.g. using logistic regression
- Group patients based on this predicted probability of treatment = propensity score
- For each propensity score group, calculate the average outcome (y) for patients with the treatment (x=1) and without the treatment (x=0)
- For each propensity score group, calculate the difference between x=1 and x=0 averages above and then average all those differences (maybe using some weighting scheme)
- The average difference is the treatment effect
Please let me know if I got this part wrong!
Another alternative, not involving propensity score at all, would be like this:
- Predict the outcome (y) from the covariate AND the treatment (x), e.g. using logistic regression
- Study the size, sign of the coefficient in front of x, and also check if it is statistically significant
- Done!
Will the two methods give me the same result (probably not)? If not, which one is preferred and why? Does it differ if I have more than one treatment or covariate?
I'd be very happy if you could find the time to help me with this one. And please keep in mind that I'm just a simple engineer, so please limit the number of difficult statistical words and formulas :-)