Is there an issue using an imbalanced covariate (not dependent) in logistic regression?

Question

I am investigating data from a randomised control trial, where treatment allocation is done on a 2:1 ratio (2 patients on the experimental treatment for every 1 patient on placebo). 400 on experimental treatment and 200 on placebo, for example.

I am conducting some exploratory analysis to investigate if there are any significant interactions between the treatment term (binary covariate) and other selected covariates, when using death (binary dependent variable) as the outcome.

As mentioned, the trial has 2:1 randomisation, so there is an imbalance in the number of patients on the experimental drug and placebo.

My plan is to build a logistic regression model with death as the dependent variable. The model will include a treatment term and other relevant covariates (selected using AIC) and also any relevent interactions involving the treatment term.

My question surrounds the imbalance in the treatment allocation (Note: not the imbalance in the dependent variable death): Does the fact that the treatment arms are unequal (400 experimental, 200 placebo) have any impact on the conclusions I can draw from the logistic model? I have been led to believe that the treatment imbalance leads to differing variance in each treatment arm (np(1-p)) - is this really a problem? If so, can it be solved?

I have considered using upsampling to balance the treatment arms. Is there anything such as weighted logistic regression or conditional logistic regression that would be suitable?

I realise there has been much discussion surrounding unbalanced datasets, however the discussion is usually focused on imbalance in the dependent variable: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

score 2 · Accepted Answer · answered Dec 31 '20 at 04:16

There is nothing wrong with this and nothing you can do. Compared to a 400-400 split, you have less precision; compared to a 200-200 split, you have more precision; compared to a 300-300 split, you have less precision. But it doesn't matter; the data you have is the data you have. No weighting or sampling can change that. You can increase precision by including covariates in the estimation of the effect using the techniques recommended by Culantuouni and Rosenblum (2015). It is rarely the case that all binary predictors in an analysis are exactly 50-50 distributed. It's true that among all possible allocations, an even split yields the best precision (all else equal), but that doesn't mean you have to have it that way. If an event is rarer in the experimental group than in the placebo, then you increase your ability to detect effects by allocating more participants to that group, which was probably the motivation here.

Is there an issue using an imbalanced covariate (not dependent) in logistic regression?

1 Answers1