References on handling nonresponse (participant dropout) bias in experimental setting

Question

Imagine I do a randomized experiment at the beginning of the school year. Incoming freshmen (a) participate in a diversity class or (b) do not. At the end of the year, I send them emails asking to fill out one 4-point Likert question on how they feel toward diversity on campus.

Now, imagine that some given $k$ percent of students do not answer this item. However, I have a large number of variables about the students that both did and did not drop out: demographics, classes they took, where they are from, their high school GPA, etc.

I want to make a valid causal inference about if the diversity class had any effect on attitudes toward diversity, using an ordered logistic regression. However, the non-response/dropout rate could bias this: What if the people who didn't support diversity respond at a lower rate? How should I handle the information I can get from these $k$ percent of cases that dropped out? I could do a logistic regression seeing if any of the variables I have predict non-response—but what do I do after that?

Note that many treatments of this focus on surveys and polls, where the goal is to generalize to a population. That is not my goal here: I am interested in retaining validity of causal inferences of my experimental condition.

I am unfamiliar with this area: What are some references to get me up to speed on how to analyze these data in a way that provides valid causal inferences? I know there are solutions like propensity score matching and weighting cases based on demographics, but I do not know where to begin with my studies of the issue. Where should I begin? Any good papers, books, tutorials, R packages and vignettes, etc.?

It's a well-explained question. You are unlikely to find a source that's tailor-made to your situation. The best thing is to study in depth the literature on causal inference, survey research, survey bias, and missing data. You may find this of interest: https://symposium.nestat.org/short-courses.html#causal — rolando2, Feb 12 '18 at 12:56
@rolando2 if it is well-explained, surely there is a good textbook or article that covers it? Most of the things I find are focused on weighting as a method of generalization, not necessarily addressing causal inference. — Mark White, Feb 13 '18 at 02:26

score 1 · Answer 1 · edited Feb 15 '18 at 19:13

Briefly, yes - there are several methods to help address non-random (a.k.a. "informative") missingness due to non-response or dropout. In your example, there is a potential for bias if measurement (response to the survey) is related to the outcome (support of diversity). In my work, we often have informative measurement of patient outcomes. For example, HIV+ individuals who are not on antiretrovirals have a very small probability of viral suppression. If they are also less likely to have their viral load measured (e.g. due to health seeking behavior or simply bc they are sicker), then failure to adjust for differential measurement will overestimate population-level suppression. For further details on my HIV example, see Petersen et al. https://jamanetwork.com/journals/jama/fullarticle/2630602?utm_source=jps&utm_medium=email&utm_campaign=author_alert-jamanetwork&utm_content=author-author_engagement&utm_term=1m

Here is a link to a short talk on a Roadmap for Causal Inference, which considers missing data as another hypothetical intervention/exposure variable: https://works.bepress.com/laura_balzer/50/

To control for informative measurement, one could use parametric G-computation, inverse probability weighting, or targeted maximum likelihood estimation. A full course (taught at UC Berkeley and UMass Amherst) on this framework and methods is available at http://www.ucbbiostat.com/

score -1 · Answer 2 · answered Feb 16 '18 at 09:16

Let's start with the elephant in the room: I don't think any statistical method can ensure a completely valid causal inference from your experimental design - you can definitely extract useful information, but if non-response and attitude towards diversity are correlated in both groups, your conclusions will always be limited and you should be aware of that. Using a large number of other predictors (GPA, etc.) might easily only make your analysis more fragile - be very wary of analyzing subgroups separately, you might be just chasing noise. You might improve your chances if you are able to collect attitude towards diversity before the intervention in both groups. Also I think your use of Likert is incorrect and likely to bias your inference (see below for more detail).

With that cleared, what can be your best chance to extract useful information? The answer by L.B. Balzer seems to have a good overview of the frequentist approach, I would however place my bets on a Bayesian approach, where an interesting approach is given in Si et al. (including code at the end). The bonus part being that if you go Bayes, you don't have to worry that much about too many predictors, multiple comparisons etc. and you get more interpretable results (you can say things as "if the model is (approximately) correct, the probability of a n effect > X is Y%").

An aside on Likert items and scales: I think using a single Likert item (question) to measure something as complex as attitude towards diversity will be hihgly misleading. In particular, it is difficult to make sure everyone understand's the same thing from a single question - diversity is a loaded word! That's why it is preferable to use Likert scales (items and scales are often confounded, but those are distinct things). A Likert scale is a combination of multiple Likert items linked to the same construct you want to measure, sometimes in positive sometimes in negative sense. E.g. for your case some of the items might be:

Some people on the campus are too different from me and it makes me feel uncomfortable
I enjoy meeting people with different social background, sexual or gender identity
I think the community in the campus is too homogenous

Then you reverse the negatively coded questions and take the average of responses to the individual items -> this average is then your variable of interest. In addition to reducing noise from misunderstanding of some of the questions, the average tends to be almost normally distributed, so you don't have to do any complicated ordered logit or something. Some source for further discussion of this. I've heard (but not verified) this approach was in the original publication by Likert, but people tend to forget.

Note that it is hard to design good Likert scales - there might be some well tested and validated set of items already available for diversity!

References on handling nonresponse (participant dropout) bias in experimental setting

2 Answers2