Should I use a machine learning model to calculate propensity score?

Question

In my study, running a simple linear model to calculate de propensity score for each example seemed to not be able to model my treatment choosing process correctly. My question is, does it make sense to use a "stronger" model (svm, nn, you name it) to try and obtain a more precise propensity score?

Thank you in advance!

Since the propensity score is a conditional probability, you should use a probability model, like logit. I guess you can post-process classifiers like svm to get probabilities from them but that's likely worse. If the goal is to control for vector $X$ with the propensity score $\Pr(T=1|X)$, one alternative strategy where a stronger (ML) model can make sense is to model $E[Y|T,X]$ directly. Based on what do you conclude your current propensity score model isn't right? — CloseToC, Aug 07 '19 at 10:15

Noah · Accepted Answer · 2019-12-14T00:11:11.953

There are two approaches for modeling propensity scores. One is to try to approximate the treatment assignment process as closely as possible, and the other is to obtain propensity scores that yield covariate balance.

The first approach relies on the finding that balancing on a well-formed propensity score balances all pre-treatment covariates fully (i.e., their entire joint distribution). This is what Rosenbaum & Rubin (1983) discovered and why the propensity score has become so important. A problem with this is that there is almost no hope in correctly modeling the treatment process to obtain propensity scores, and some evidence that even correctly modeling it parametrically is inefficient (Kim, 2019). There have been many alternatives developed that use machine learning methods to flexibly model the propensity score. The two most effective from what I've seen have been Bayesian Additive Regression Trees (BART; Hill, 2011; applied to propensity score modeling Hill et al., 2011) and SuperLearner (Pirrachio et al., 2015). BART is a sum-of-trees-approach that uses a Bayesian prior to prevent overfitting while allowing the model to be very flexible. SuperLearner is a stacking method that allows you to supply many different machine learning methods and it either picks the best one or takes an optimally weighted combination of them. If any of the machine learning methods approximate the true model, SuperLearner will perform as well or better than the best (asymptotically).

The other approach involves estimating propensity scores that yield balance. I'm defining balance as the case where the means of every term in the outcome model are the same between the treatment groups. For example, if the outcome model is $Y=\tau Z + \beta_1 X_1 + \beta_2 X_2 + \epsilon$ where $X_2 = \exp(X_1)$ and $Z$ is the treatment, balance is the case when $\bar{X}^1_1 - \bar{X}^0_1$ and $\bar{X}^1_2 - \bar{X}^0_2$ are close to $0$, where $\bar{X}^z_p$ is the mean of $X_p$ in treatment group $z$. When taking this approach, it is recommended that analysts try many different propensity score models to find the one that achieves balance, regardless of whether it mimics the true treatment assignment mechanism (Ho et al., 2007). There are propensity score estimation methods that target balance as part of their estimation: the TWANG implementation of generalized boosted modeling (McCaffrey, Ridgeway, & Morral, 2004) selects the number of trees to use in computing predicted values from a boosted classification based on balance criteria selected by the user. The covariate balancing propensity score (Imai & Ratkovic, 2014) incorporates mean balance directly into the estimation of a logistic regression model for the propensity score. There are other methods that bypass a propensity score model and go straight to estimating weights that balance covariates, including entropy balancing (Hainmueller, 2012) and Stable Balancing Weights (Zubizarreta, 2015), though it has been found that these methods implicitly fit a propensity score model. A problem with these methods is that one has to have a good idea about the form of the outcome model. That said, with some of these methods, it's possible to achieve balance on many moments of the covariate distributions (i.e., mean, variance, skew, etc.) and their interactions so that whatever the outcome model is, adequate balance will be achieved.

Regardless of which approach you choose, you should assess balance on your covariates. You ideally want to manage the bias-variance trade-off by ensuring balance on as many covariates and their transformations as possible while retaining a high effective sample size. There is no way to know what the optimal trade-off is without relying on deep substantive knowledge or modeling the outcome. Indeed, in many cases, I recommend modeling the outcome rather than using propensity scores alone. Using BART for the outcome model with a BART-estimated propensity score included with the covariates has proven to be extremely effective (Dorie et al., 2019) and is easy to implement in the bartCause R package.

Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science, 34(1), 43–68. https://doi.org/10.1214/18-STS667

Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies. Political Analysis, 20(1), 25–46. https://doi.org/10.1093/pan/mpr025

Hill, J. L. (2011). Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics, 20(1), 217–240. https://doi.org/10.1198/jcgs.2010.08162

Hill, J., Weiss, C., & Zhai, F. (2011). Challenges With Propensity Score Strategies in a High-Dimensional Setting and a Potential Alternative. Multivariate Behavioral Research, 46(3), 477–513. https://doi.org/10.1080/00273171.2011.570161

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis, 15(3), 199–236. https://doi.org/10.1093/pan/mpl013

Kim, K. il. (2019). Efficiency of Average Treatment Effect Estimation When the True Propensity Is Parametric. Econometrics, 7(2), 25. https://doi.org/10.3390/econometrics7020025

McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity Score Estimation With Boosted Regression for Evaluating Causal Effects in Observational Studies. Psychological Methods, 9(4), 403–425. https://doi.org/10.1037/1082-989X.9.4.403

Pirracchio, R., Petersen, M. L., & van der Laan, M. (2015). Improving Propensity Score Estimators’ Robustness to Model Misspecification Using Super Learner. American Journal of Epidemiology, 181(2), 108–119. https://doi.org/10.1093/aje/kwu253

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.1093/biomet/70.1.41

Zubizarreta, J. R. (2015). Stable Weights that Balance Covariates for Estimation With Incomplete Outcome Data. Journal of the American Statistical Association, 110(511), 910–922. https://doi.org/10.1080/01621459.2015.1023805

Thank you for the clear and complete explanation on both paths of propensity scores! — lsfischer, Aug 08 '19 at 09:25
Credit where is due. This is a proper answer. (+1) Great to see some newer references. — usεr11852, Aug 08 '19 at 17:12
Wonderful answer. I want just to make a comment about the following statement: "_I'm defining balance as the case where the means of every term in the outcome model are the same between the treatment groups._" Covariates balance should be achieved for the whole distribution of the variables, not just for the first moment. Clearly, it may be infeasible chekcing all the univariate distributions, or the entire multivariate distribution, and so we are satisfied with comparisons of first and second moments. But ideally we would like that $X_j$ are distributed identically in both groups. — Plastic Man, Mar 02 '22 at 09:29
Moreover, Imbens and Runib (2015, chapter 14) shows that covariates have the same distribution across groups iff the average propensity score (at population level) is the same across treatment arms. So, investigation of the (estimated) propensity score may be useful in order to check for the balance (clearly assuming that our estimated function is at least a very good approximation of the true score). — Plastic Man, Mar 02 '22 at 09:31
@PlasticMan regarding checking balance on the propensity score, [Ho et al. (2007)](https://doi.org/10.1093/pan/mpl013) and [Stuart et al. (2013)](https://doi.org/10.1016/j.jclinepi.2013.01.013) argue otherwise. Balance on a bad propensity score won't yield covariate balance, but the only way to assess whether a propensity score is good is to see if it achieves covariate balance. So covariate balance is primal and balance on the propensity score is incidental. — Noah, Mar 02 '22 at 14:39
Thanks for the references, I will look into them. Anyway, the main point of my comment was about how to define covariate balance (not just comparing first moments), and to points to some theoretical results about the propensity score (at the population level). — Plastic Man, Mar 02 '22 at 15:44

score 0 · Answer 2 · answered Mar 02 '22 at 03:34

For most recent work state of the art work, have a look at the conference for Causal Learning and Reasoning (CLeaR) 2022.

If you are interested in probabilistic models that estimate the full joint distribution for

$$Pr(Y, T| X) = Pr(Y | T, X) \cdot Pr(T | X)$$ (not just conditional on T & X) have a look at Kelly, Kong, Goerg (2022) on "Predictive State Propensity Subclassification (PSPS): A causal inference algorithm for data-driven propensity score stratification". It's a fully probabilistic framework for causal inference by learning causal representations in the predictive state space for Pr(outcome | treatment, features). See paper for details (disclaimer: I am one of the authors).

For a ready to go TensorFlow keras implementation see https://github.com/gmgeorg/pypsps with code examples and notebook case studies to predict unit level treatment effects, propensity scores, counterfactual predictions, etc.

Should I use a machine learning model to calculate propensity score?

2 Answers2

Linked

Related