Bootstrapped Propensity Score Matching

Question

Assume we have a sample where the treatment is present for only a small fraction of the sample, and we want to exploit the treatment using propensity score matching. However, since the treatment group is small, matching with a caliper distance will result in a completely different matched sample everytime and, thus, the randomness is results is huge?

After reading some literature, I found mixed arguments, but: is it common practice to, for example, repeat the caliper matching 1,000 times and estimate average coefficients for the matched sample (without bootstrapping the original sample)? And, subsequently, estimate the standard error as the standard deviation of all average coefficients? Argumentation would of course be to get unbiased estimates, since the matching is always random (in caliper).

I found some articles that state that similar methods yield accurate estimates of the population parameters, however, I don't find many practical evidence of reseachers doing this method, e.g.:

Austin, P. C., & Small, D. S. (2014). The use of bootstrapping when using propensity-score matching without replacement: a simulation study. Statistics In Medicine, 33(24), 4306-4319
Bai, H. (2013). A Bootstrap Procedure of Propensity Score Estimation. Journal Of Experimental Education, 81(2), 157-177.

Thanks ;)

Possible duplicate of [Different results after propensity score matching in R](https://stats.stackexchange.com/questions/260918/different-results-after-propensity-score-matching-in-r) — Mr Pi, Nov 10 '17 at 11:08
The issue is similar, however, there is not an answer in the comments that provides explanation on how to proceed. The only advice is 'just pick a sample', but this is ofcourse highly biased because you will always pick a sample with favorable results. — Oscar, Nov 10 '17 at 11:33
You wrote "However, since the treatment group is small, matching with a caliper distance will result in a completely different matched sample everytime and, thus, the randomness is results is huge?" but this is only true if, in addition to the group being small, the variation in the group is large. In this case, I think you want a lot of randomness to capture that. — Peter Flom, Nov 10 '17 at 12:41
Thanks. Indeed if the treatment group becomes smaller, the 'randomness' of the results increases. Therefore, to capture most of this randomness, I thought of estimating 10,000 models and then averaging out all estimates and obtain somewhat bootstrapped standard errors. Would you disagree with this method? — Oscar, Nov 10 '17 at 13:47

score 1 · Answer 1 · answered Nov 10 '17 at 12:32

1

That is a lot of trouble. I fear any method that is arbitrary and results in discarding observations that are not incomparable with regard to propensity. There are many reasons to do either straight covariate adjustment, or covariate adjustment using the logit of the propensity score. These are detailed in BBR Section 10.1 and Chapter 17.

answered Nov 10 '17 at 12:32

Frank Harrell

74,029
5
148
322

Thanks Frank, I will have a look at these materials. Would you disagree with above proposed methods, relative to the ones you proposed here? – Oscar Nov 10 '17 at 13:48
1

For example, look at this study: Austin, P. C., & Small, D. S. (2014). The use of bootstrapping when using propensity-score matching without replacement: a simulation study. Statistics In Medicine, 33(24), 4306-4319. doi:10.1002/sim.6276 – Oscar Nov 10 '17 at 13:59
I haven't studied the use of the bootstrap in this context. I get the feeling there is a more direct approach out there. – Frank Harrell Nov 10 '17 at 14:33

score 1 · Answer 2 · answered Nov 12 '17 at 01:20

The bootstrapped distribution of effect estimates using your approach cannot be interpreted as an approximation to the population sampling distribution of effects from either a sampling or randomization perspective, so it would be inappropriate to use the standard errors you got from that method as standard errors of the effect estimate.

The approach to average the effect estimates seems reasonable to me; it's a version of multiple imputation, where you are imputing the missing potential outcomes multiple times with different matched units in each "imputation" (.e., each bootstrap iteration). It would then be appropriate to apply Rubin's rules to get an effect estimate and standard errors from the imputed samples. Your standard error would include within-imputation variability due to sampling/randomization AND across-imputation variability due to the variation in matched units. The variability in your bootstrap distribution of effect sizes would be the main factor in this second component, but you'd still need to compute the first component using traditional methods (e.g., Abadie and Imbens SEs, etc.). This approach has not been studied, but this might make an interesting research project. Don't take my word for it; this is intuition based on my understanding of multiple imputation and related processes of arranging data within your sample (e.g., parceling in confirmatory factor analysis).

I think a better approach would be to use a deterministic matching algorithm and using all matched units. For example, if 5 control units fall within the caliper of 1 treated unit, use all 5 control units as matches. This is called variable matching and has been shown to be more effective than 1:1 matching at reducing bias and variability.

Thanks, Noah! Just for my clear understanding: averaging the effect estimates using which method is appropriate [(i) the repeated matching of original sample, (ii) one time bootstrap original sample and then match from this repeated times, or (iii) one time match original sample and then bootstrap matched pairs from this repeated times]? I also thought about the deterministic matching algorithm, but also found this literature: "Austin, P. C. (2010). Statistical criteria for ... the propensity score", which concludes that more than 2 matches actually increase the bias. Do you agree? — Oscar, Nov 12 '17 at 10:52
I was saying that (i) may be appropriate if you include the additional variability due to the repeated matching in your estimate of the standard error of the effect estimate, as you would do in multiple imputation. Austin's simulations can be helpful guides, but they often miss the point that the average researcher is using balance in the matched sample to evaluate whether to proceed, which his simulations do not. You want to use whichever matching scheme yields the best covariate balance. It may also be that the characteristics of your data set are not well represented in his simulations. — Noah, Nov 12 '17 at 16:10

Bootstrapped Propensity Score Matching

2 Answers2