Favored methods for overcoming selection bias (special attention to healthcare fields)?

Question

I am frequently measuring the effect of behavioral health treatment interventions on outcomes of interest. However, comparing the relative efficacy of different types of treatment is tricky - more intensive interventions may indicate clients with more severe issues, whose outcomes will be more frequently negative anyway. RCTs are generally unethical in the areas I'm studying.

What are your favorite approaches for addressing this sort of selection bias - where level of need determines intervention type, but level of need also plays a role in determining outcome? What are your critiques of common approaches?

Some of the approaches I've explored (note that when I say covariates indicating severity, there is no magic variable I have that shows "this treatment is what a person needs"; this is all based in theory and observed/available data, but are just likely indicators that have to be taken into account with other factors):

Multivariate models including covariates for severity of condition (e.g., primary diagnosis, history of emergency services, etc.);

Propensity-score matching, with the same factors predicting treatment type and outcome (but can only examine one treatment type at a time);

Latent class analysis (built off covariates that may indicate severity);

Only running models on tightly-defined groups (e.g., only on people with one specific type of diagnosis).

Latent class analysis classifies respondents based on the indicator variables you select. When you say "built off covariates that may indicate severity", it sounds like you might be thinking of forming classes like low severity, medium severity, high severity. I don't think this gains you anything in terms of causal inference. You might as well just enter the variables into a regression. Thus, I respectfully disagree that this post should get the latent class tag. However, can you show articles that are a counterexample? — Weiwen Ng, Aug 22 '19 at 18:38
Point of information: running models on tightly-defined groups could be a subset of stratified analysis. — Weiwen Ng, Aug 22 '19 at 18:39

score 5 · Answer 1 · answered Aug 19 '19 at 01:19

There is no single magic bullet to estimate treatment effects in the context of confounding (note: "selection bias" can mean something else). There is also no agreement in the field about the best method, and the best method for a given problem may differ from the best method for another (and neither will be immediately apparent). My understanding is that some of the best performing methods are the "multiply robust" methods, which include targeted minimum loss-based estimation (TMLE) and Bayesian additive regression trees (BART) with a BART propensity score. I describe these methods with references in this post.

These methods are multiply robust in that there are numerous forms of misspecification that they are robust to (i.e., they will give you an unbiased or low-error estimate even if you get some things wrong about the relationships among variables). The more standard doubly robust methods are those that give you two chances to correctly specify a model in order to arrive at an unbiased estimate of the treatment effect. Augmented inverse probability weighting (AIPW) with parametric outcome and propensity score models is one such example; if either the outcome model or propensity score model is correct, the effect estimate is unbiased. Multiply robust methods are robust to these misspecifications but also to misspecifications of the functional form of the relationship between the covariates and the treatment or outcome. They gain this property through flexible nonparameteric modeling of these relationships. Such methods are highly preferred because they require fewer untestable assumptions to get the right answer, in contrast to propensity score matching or regression, which require strong assumptions about functional form.

I would check out the best performers of the annual Atlantic Causal Inference Conference competition, as these represent the cutting edge of causal inference methods and are demonstrated to perform well in a variety of conditions. TMLE and BART were two of the best performers, and are both accessible and easy to use.

I'm not going to write off the other methods you mention, but they do require many assumptions that cannot easily be assessed or they have been demonstrated to perform poorly in a number of contexts. They are still the standards in the health sciences, but that is slowly changing as the advanced methods become better studied and more accessible.

Do you know which methods performed best at the 2019 ACIC Data Challenge? — RobertF, Sep 05 '19 at 15:23
I don't recall (I was there though!), but I do remember BART was high up there. They tend to be slow to release the results, but you can attempt to ask the organizers of the conference. — Noah, Sep 06 '19 at 05:55

score 2 · Answer 2 · answered Aug 22 '19 at 18:57

I don't disagree with Noah's answer. I have never heard of Bayesian Additive Regression Trees or with targeted minimum-loss estimation, so I can't comment on those specifically. Methods involving weighting and propensity scores are well-accepted in epidemiological circles.

You should also consider instrumental variable and regression discontinuity approaches.

In the former, there are sometimes cases where you have a variable that influences the probability of receiving treatment but not the outcome. For example, McClellan et al (1994) noted that some hospitals treated acute myocardial infarction (the fancy term for heart attack) more intensively than others (i.e. they were more prone to use cardiac catheterization and revascularization, as opposed to what I guess is medical management). They used the differential distance as their instrument: for each patient, what was the distance to the nearest high-catheterization hospital minus the distance to the nearest low-catheterization hospital?

IVs are not without untestable assumptions - just like all observational methods, really. Also, they answer a subtly different question than a randomized trial would. Quoting McClellan et al

Thus, IV methods are ideally suited to address the question, "What would be the effect of reducing the use of invasive procedures after AMI in the elderly by, for example, one fourth?" They do not address the question, "What would be the expected effect of treating a particular patient aggressively rather than with noninvasive therapies alone?" For clinical decisions involving treatment of individual patients, the answer to the latter question is more useful. For policy decisions affecting the treatment of patient populations, the answer to the former is likely to be more useful.

Alternatively, sometimes you have cases where treatment is given to people at or above a cutoff point on some sort of score, and withheld from everyone below the cutoff. You can exploit that in a regression discontinuity design. You'd compare people just above the cutoff to people just below it. The inherent assumption is that because all scores are measured with error, the people just above the cutoff and the people just below it are pretty similar. This does also require that the participants did not game the score - which is an assumption that you should really think about. In some ways, being above versus below the score is an instrument.

The issue is that it may be difficult to find an instrument, and that the treatments you're interested in may be not be assigned according to some score.

I encourage you to look into BART and TMLE! They hold a lot of promise and are slowly gaining traction in the methodological world. — Noah, Aug 22 '19 at 20:14

Favored methods for overcoming selection bias (special attention to healthcare fields)?

2 Answers2

Linked