2

Is it possible to use linear regression to compare matched samples in an observational study?

In order to try to eliminate confounding factors, two groups were matched on a number of background characteristics using nearest-neighbor-matching. The matched groups were created from a larger, non-random sample of participants.

Is it sufficient to enter the background variables as covariates into the regression model?

Does the fact that ultimately the paired samples are still only derived from a larger non-random sample affect the assumptions of linear regression, not to mention significance testing?

Many thanks!

Noah
  • 20,638
  • 2
  • 20
  • 58
Daniel
  • 35
  • 5

1 Answers1

4

See my answer here for a discussion of two approaches to standard error estimation for matched samples. For a discussion on standard error and effect estimation when using matching with replacement, see the wonderfully clear and underappreciated Hill & Reiter (2006). For a related but less complete discussion for standard error estimation when using matching without replacement, see Austin & Small (2014).

Neither of these directly answers your question. There is no consensus on how to proceed. As I mention in my linked answer, there are two philosophies about the interpretation of matching: one in which matching is nonparametric preprocessing that doesn't affect effect or variance estimation, and one in which matching is a specific analytic technique that changes the variance of the estimate and requires special procedures. There is a third philosophy that relies on randomization-based inference, where the inference is over the possible treatment assignments for the sample given rather than over multiple samples drawn from the population. From my view, it's not immediately clear which approach is optimal or best justified.

My perspective is that the most mainstream practice is to perform regression in the matched sample, ignoring the variability due to estimating the propensity score (if that was done) and due to matching, and possibly accounting for the correlation between paired units. This is the advice given by Ho, Imai, King, & Stuart (2007). They argue that the analysis that you would have done before matching is the one that you should do after matching, with no further adjustment for the matching. That means you can perform a t-test, run a regression, or run any other kind of analysis in your matched set. It's likely that some of the assumptions of regression (linearity, exogeneity) are better satisfied in the matched set than in the unmatched sample.

Note that if you are using teffects in Stata or Matching in R, this is not how effects are estimated. If you manually matched, used psmatch2 in Stata, or used MatchIt in R, then effects are typically estimated using regression in the matched set, optionally including covariates in the regression.


Austin, P. C., & Small, D. S. (2014). The use of bootstrapping when using propensity-score matching without replacement: A simulation study. Statistics in Medicine, 33(24), 4306–4319. https://doi.org/10.1002/sim.6276

Hill, J., & Reiter, J. P. (2006). Interval estimation for treatment effects using propensity score matching. Statistics in Medicine, 25(13), 2230–2256. https://doi.org/10.1002/sim.2277

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis, 15(3), 199–236. https://doi.org/10.1093/pan/mpl013

Noah
  • 20,638
  • 2
  • 20
  • 58
  • Thank you Noah for this elaborate and detailed answer! Lots to read up on first. What worries me though is that, whether I use the matched or the unmatched one, the predictors I have won't be sufficient to completely account for any bias due to the non-probabilistic ex post facto design. Since there are no population data available that would allow me to create a representative subsample, I fear that no matter how I try to fix things statistically, this will remain a source of criticism. Or are there ways to as yet extrapolate to a population despite the non-representative sample? Thx again! – Daniel Apr 27 '20 at 17:42
  • Do you mean that you worry that your subsample differs meaningfully from the population to which you want to generalize your findings, or you worry about the failure to control for enough variables to be able to claim the effect estimate represents a causal rather than purely associational quantity? There are ways of dealing with both worries but they are beyond the scope of this question. Feel free to make another post and someone can chime in. – Noah Apr 27 '20 at 17:58
  • Mostly the former. I've created the question [here](https://stats.stackexchange.com/questions/463106/generalize-to-population-from-a-nonrandom-sample-in-ex-post-facto-design). I'll first read what you've provided before opening a question about the latter. Thanks! – Daniel Apr 27 '20 at 19:46