When we compare groups on control variables should we be using tests of equivalence?

Question

In many papers that consider treatments and outcomes, I see tables (usually "table 1") of what might be called nuisance variables (often demographics, sometimes medical conditions) with tests of significance and text such as "the groups were broadly similar, there were no significant differences on XXXXX, see Table". So the clear goal is to show that the groups assigned to different treatments are similar.

However, this seems to me like it could be "accepting the null" and that what we should be doing (or demanding be done) is tests of equivalence.

This could apply to randomized trials or to observational studies. Am I missing something here?

I gather you are referring to 'table 1'. Are you asking about RCTs per se, or also observational studies? — gung - Reinstate Monica, Mar 13 '18 at 14:31
@gung yes, it's usually Table 1. It could be observational studies or RCTs. I edited my question to reflect your comment. — Peter Flom, Mar 13 '18 at 14:53
Even if I run the risk of stating the obvious: There are some papers that address this issue (e.g. [de Boer et al. (2015)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4310023/pdf/12966_2015_Article_162.pdf)). I think the consesus is that hypothesis testing should be abandoned in baseline tables. The [CONSORT Statement](http://www.consort-statement.org/) for clinical trials as well as the [STROBE Statement](https://www.strobe-statement.org/index.php?id=strobe-home) for observational studies recommend avoiding hypothesis testing in baseline tables. If equivalence tests are better, I don't know. — COOLSerdash, Mar 15 '18 at 20:36
Whether you test against null or test for equivalence depends on the motivation and affects the discussion that can be drawn from the table. Asserting equivalence is a very strong condition and I suspect not necessary for most cases unless the author wants to draw strong conclusions about the demographics etc. It would be better and more appropriate having a formalised procedure for quantifying risk of bias biased on imbalances in the associated demographics. I've not looked into that but would be interested in others opinions as to what that may look like. — ReneBt, Mar 16 '18 at 09:43

AdamO · Accepted Answer · 2018-03-14T14:00:03.443

This is a complicated issue that introduces many related issues of: 1) clearly specifying a hypothesis, 2) understanding what causal mechanisms (may) underlie a hypothesized effect and 3) choice/style of presentation.

You're right that, if we apply sound statistical practice, to claim that "groups are similar", one would have to perform a test of equivalence. However, tests of equivalence suffer the same issues as their NHST counterpart: the power is merely a reflection of the sample size and the number of comparisons: we expect differences, but their extent and effect on a main analysis is far more important.

When confronted by these situations, baseline comparisons are almost always red-herrings. Better methods (of science and statistics) can be applied. I have a few stock concepts/responses that I consider when answering questions like this.

A "total" column is more important than split-by-treatment columns; a discussion is warranted of those values.

In clinical trials, the safety sample is usually analyzed. This is the subset of those who were first approached, then consented, then randomized, and finally exposed to at least one iteration of control or treatment. In that process, we face varying degrees of participation bias.

Probably the most important and omitted aspect of these studies is presenting Table 1 results in aggregate. This achieves the most important purpose of a Table 1: demonstrating to other investigators how generalizable the study sample is to the broader population in which the results apply.

I find it surprising how fixated investigators, readers, and reviewers are on the tangential trends within patient characteristics when there is a complete disregard to the inclusion/exclusion criteria and the generalizability of the sample.

I'm ashamed to say I was an analyst on a trial that overlooked this as an issue. We recruited patients and then, due to logistical issues, we waited nearly a year before implementing the intervention. Not only did the consort diagram show a huge drop between those periods, but the sample shifted. The result was largely un/underemployed, older, and healthier than the people we intended to reach. I had deep concerns about the generalizability of the study, but it was difficult to lobby for those concerns to be made known.

The power and Type-I error of tests to detect imbalance in baseline characteristics depends on the actual number of characteristics

The point of presenting such a detailed listing of baseline variables, as mentioned previously, is to give a thorough snapshot of the sample; their patient history, labs, medications, and demographics. These are all aspects that clinicians use to recommend treatment to patients. They are all believed to predict the outcome. But the number of such factors is staggering. As many as 30 different variables can be compared. The crude risk of Type I error is 1-(1-0.05)^30 = 0.79. Bonferroni or permutation corrections are advisable if testing must be performed.

Statistical testing in its purest form is meant to be impartial, and it is supposed to be prespecified. However, the choice and presentation of baseline characteristics is often relative. I feel the latter approach is appropriate: if we find, like in my trial, there are interesting traits that describe the sample effectively, we should have the liberty to choose to present those values ad hoc. Testing can be performed if it is of any value, but the usual caveats apply: they are not hypotheses of interest, there is a high risk of confusion as to what significant and non-significant results imply, and the results are more a reflection of sample size and presentation considerations than of any truth.

Rerandomization can be done, but only before patients are exposed to treatment

As I mentioned, the analyzed sample is typically the safety sample. However, rerandomization is a heavily advocated and theoretically consistent approach to patients who have not been exposed to study treatment. This only applies to settings in which batch enrollment is performed. Here, 100 participants are recruited and randomized. If, for instance, probability assigns a high proportion of older people to one group, then the sample can be rerandomized to balance age. This can't be done with sequential or staggered enrollment, which is the setting in which most trials are conducted. This is because timing of enrollment tends to predict patient status by prevalent case "bias" (confusing incident and prevalent eligibility criteria).

Balanced design is not a requirement for valid inference

The randomization assumption says that, theoretically, all participants will have on average equal distributions of covariates. However, as mentioned earlier, when comparing 30 or more levels, the cumulative probability of imbalance is non-negligible. In fact, imbalance of covariates may be irrelevant when considering the whole.

If the randomization is fair, we may see age is elevated in the treatment group, but smoking is elevated in the control group: both of which contribute individually to the risk of the outcome. What is needed for efficient and valid inference is that the propensity score is balanced between groups. This is a much weaker condition. Unfortunately, propensity cannot be inspected for balance without a risk model. However, it's easy to see that such propensity depends on a combination of covariates, and the likelihood of an imbalance in propensities in a randomized sample is far less probable, despite being impossible to show exactly.

If a risk model is known, or strong predictors of the outcome are present, more efficient and valid RCTs are done by simply adjusting for those factors regardless of whether they're balanced between treatment groups

One of my favorite papers, 7 myths of randomized controlled trials, discusses this. Adjustment improves efficiency when the adjustment variable is strongly predictive of the outcome. It turns out that even with perfect 50/50 balance, using say blocked randomization, or even as a coincidence of how randomization was performed, the adjustment will shrink CIs, requiring fewer participants to have an equally powered study; this reduces costs and risks. It is shocking that this isn't done more often.

Observational studies require control for confounding regardless of what Table 1 shows

The randomization assumption eliminates confounding. With nonrandomized treatment, there is confounding. A confounder is a variable which is causal of the outcome and predicts receipt of the quasi-experimental treatment. There is no test to determine which variable(s) is/are confounders. The risk of peeking into the data to answer these questions is that confounders are virtually indistinguishable from mediators or colliders without utterly perfect measurement of longitudinal values (and even then...). Adjusting for mediators attenuates any effect, collider-adjustment can cause any type of bias. Further, one need not adjust for a total set of confounders, but rather they must remove the backdoor criterion.

For instance, in a study of lung function and smoking in adolescents: older kids are more likely to smoke, but since they are taller, their lung function is greater. It turns out the adjusting for height alone suffices to remove confounding since it satisfies the backdoor criterion. Further adjustment for age simply loses efficiency. However, merely inspecting the "balance" of a table 1 in smokers and non-smokers would suggest that both age and height are "imbalanced" and thus should be controlled for. That is incorrect.

I agree with this and am well aware of the problems with p values. (You will find few people on this site or are more anti-p value than I am). And I'm all for better methods, some of which you raise. Of course, some variables could be suppressors (so that including them increases the size of the main effect). However, if I am, say, reviewing a paper for a journal, then do you think recommending equivalence tests for table 1 is good, or would you go for your full answer here? — Peter Flom, Mar 14 '18 at 11:46
@PeterFlom I see the context a bit better now. As a statistical reviewer, I would consider whether the comment is relevant to the subsequent analyses. If it is not relevant, I would encourage them to strike that comment out as it's not useful. If it is relevant, I would encourage them to a) consider a more robust analysis approach or b) use sensitivity analyses to determine whether a possible influence is there. The balance of covariates only matters insofar as it influences analyses, so that's where I would prefer the attention be given. It isn't a propensity-matched design, perhaps, is it? — AdamO, Mar 14 '18 at 13:38
@PeterFlom As a reviewer, wouldn't it make sense to recommend to get rid of p-values in "Table 1" altogether? — amoeba, Mar 14 '18 at 15:00
AdamO, great answer (+1), but I am a bit concerned by the recommendation that multiple testing adjustments are "advisable" in the context of "Table 1". Is Type I error of any concern here? I feel that in this case, Type II error is actually much more important (one wouldn't want to miss the fact that some baseline variable differs between the treatment and the control groups). Using Bonferroni, Type II error will greatly increase. This is related to @Peter's point about tests of equivalence: in a sense, Type I and Type II exchange places if you switch to the "equivalence" viewpoint. — amoeba, Mar 14 '18 at 15:03
@amoeba Absolutely. If we insist on this approach (not my recommendation) NHSTs require that we control Type I error. I think my point is that we should control FWER because we don't care which variable is imbalanced. It can be set to a generous value like 0.2. I'm not aware of any equivalence test for which the power goes *up* as the sample size increases, so justifications for such tests are wordy, subjective, and imprecise. — AdamO, Mar 14 '18 at 15:06
@amoeba Yes, it would. But ... .it's hard to educate editors (at least, some editors). I think effect size is much more relevant here (and other places too). — Peter Flom, Mar 14 '18 at 21:55