Biased estimates in logistic regression due to class imbalance

Question

I was asked by a reviewer to evaluate the robustness of the results of logistic regression, given that estimates can be biased by class imbalance in the outcome.

To contextualize, I have run three different models, where the outcome all have a probability of about 25% (I wouldn't even say that this is imbalance). The objective was to analyze the association between the outcome and some covariates, not to predict future observations.

I saw related questions, but none helped me formulate an answer to this argument. Should I refute it, by saying that the outcome is not really imbalanced? Or, as @whuber wrote, pointing out that

Logistic regression tends to work well and give values reasonably close to the correct parameters even when the outcomes are imbalanced.

Any tip or reference is appreciated.

Thank you!

Which estimates would be biased, and why would "unbalanced" data be a problem? [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, May 25 '21 at 15:11
Does this answer your question? [Does an unbalanced sample matter when doing logistic regression?](https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression) — Demetri Pananos, May 25 '21 at 16:52
@Demetri That's a relevant post, but it focuses on classification rather than "analyzing the association" as stipulated here. — whuber, May 25 '21 at 16:56
Did the reviewer point to a paper that demonstrates it's an issue? The only class imbalance issue I'm aware of in the statistical literature is poor coverage of standard confidence intervals for probabilities close to 0/1. 0.25 is of course nowhere near 0 — CloseToC, May 25 '21 at 21:35

score 18 · Answer 1 · answered May 25 '21 at 16:54

Let's find out.

To begin with, what happens with balanced datasets?

Here is a scatterplot of a dataset of $200$ observations, of which $50\%$ are zeros and the remainder are ones. On it I have graphed the underlying ("true") probabilities and the probabilities as fit with logistic regression. The two graphs agree closely, indicating logistic regression did a good job in this case.

To understand it better, I kept the same $x$ values but regenerated the $y$ values randomly $500$ times. Each fit yielded its estimate of the intercept and slope ($\hat\beta$) in this logistic regression. Here is a scatterplot of those estimates.

The central red triangle plots the true coefficients $(0, -3).$ The ellipses are the second order approximations to this point cloud: one is intended to enclose about half the points and the other is intended to enclose about 95% of the points. That they do so indicates they give a solid indication of how uncertain any given estimate of such a dataset might be: the intercept could be off by about $\pm 0.45$ (the width of the outer ellipse) and the slope could be off by about $\pm 1$ (the height of the outer ellipse). These are margins of error.

What happens with imbalanced datasets?

Here's a similar setup but with only $5\%$ of the points in one class (give or take a few points, depending on the randomness involved in making these observations):

($5\%$ is truly small: it tells us to expect to see only $10$ or so values in one class with the other $190$ in the other class.)

The fit now visibly departs from the true graph -- but is this evidence of logistic regression failing to be "robust"? Again, we can find out by repeating the process of generating random data and estimating the fit many times. Here is the scatterplot of $500$ estimates.

By and large the estimates stay near the true value near $(-4,-3).$ In this sense, logistic regression looks "robust." (I kept the same slope of $-3$ as before and adjusted the intercept to reduce the rate of of the $+1$ observations.)

The margins of error have changed : the semi-axis of the outer ellipse that (sort of) describes the uncertainty in the intercept has grown from $0.45$ to over $4$ while the other semi-axis has shrunk a little from $1$ to about $0.8;$ and the whole picture has tilted.

The ellipses no longer describe the point cloud quite as well as before: now, there is some tendency for logistic regression to estimate extremely negative slopes and intercepts. The tilting indicates noticeable correlation among the estimates: low (negative) intercepts tend to be associated with low negative slopes (which compensate for the small intercepts by predicting some $1$ values near $x=-1.$) But such correlation is to be expected: this looks just like ordinary least squares regression whenever the point of averages of the data is not close to the vertical axis.

What do these experiments show?

For datasets this size (or larger), at least:

Logistic regression tends to work well and give values reasonably close to the correct parameters even when the outcomes are imbalanced.
Second-order descriptions of the correlation between the parameter estimates (which are routine outputs of logistic regression) don't quite capture the possibility that the estimates could simultaneously be quite far away from the truth.

A meta-conclusion

You can assess the "robustness" (or, more generally, the salient statistical properties) of any procedure, such as logistic regression, by running it repeatedly on data generated according to a known realistic model and tracking the outputs that are important to you.

This is the R code that produced the figures. For the first two figures, the first line was altered to p <- 50/100. Remove the set.seed call to generate additional random examples.

Experimenting with simulations like this (extended to more explanatory variables) might persuade you of the utility of a standard rule of thumb:

Let the number of observations in the smaller class guide the complexity of the model.

Whereas in ordinary least squares regression we might be comfortable having ten observations (total) for each explanatory variable, for logistic regression we will want to have ten observations in the smaller class for each explanatory variable.

p <- 5/100                     # Proportion of one class
n <- 200                       # Dataset size
x <- seq(-1, 1, length.out=n)  # Explanatory variable
beta <- -3                     # Slope
#
# Find an intercept that yields `p` as the true proportion for these `x`.
#
logistic <- function(z) 1 - 1/(1 + exp(z))
alpha <- uniroot(function(a) mean(logistic(a + beta*x)) - p, c(-5,5))$root
#
# Create and plot a dataset with an expected value of `p`.
#
set.seed(17)
y <- rbinom(n, 1, logistic(alpha + beta*x))
plot(range(x), c(-0.015, 1.015), type="n", bty="n", xlab="x", ylab="y",
     main="Data with True (Solid) and Fitted (Dashed) Probabilities")
curve(logistic(alpha + beta*x), add=TRUE, col="Gray", lwd=2)
rug(x[y==0], side=1, col="Red")
rug(x[y==1], side=3, col="Red")
points(x, y, pch=21, bg="#00000020")
#
# Fit a logistic model.
#
X <- data.frame(x=x, y=y)
fit <- glm(y ~ x, data=X, family="binomial")
summary(fit)
#
# Sketch the fit.
#
b <- coefficients(fit)
curve(logistic(b[1] + b[2]*x), add=TRUE, col="Black", lty=3, lwd=2)
#
# Evaluate the robustness of the fit.
#
sim <- replicate(500, {
  X$y.new <- with(X, rbinom(n, 1, logistic(alpha + beta*x)))
  coefficients(glm(y.new ~ x, data=X, family="binomial"))
})
plot(t(sim), main="Estimated Coefficients", ylab="")
mtext(expression(hat(beta)), side=2, line=2.5, las=2, cex=1.25)
points(alpha, beta, pch=24, bg="#ff0000c0", cex=1.6)
#
# Plot  second moment ellipses.
#
V <- cov(t(sim))
obj <- eigen(V)
a <- seq(0, 2*pi, length.out=361)
for (level in c(.50, .95)) {
  lambda <- sqrt(obj$values) * sqrt(qchisq(level, 2))
  st <- obj$vectors %*% (rbind(cos(a), sin(a)) * lambda) + c(alpha, beta)
  polygon(st[1,], st[2,], col="#ffff0010")
}

thank you for the detailed answer you provided! While I'm looking more at a theoretical discussion about the impact of imbalanced data on regression estimates (I will clarify my question accordingly), I've definitely learned something from your answer. — MDSF, May 25 '21 at 18:06
What is not theoretical about this (little) study? The basic difficulty is that "imbalanced" covers a huge amount of ground: how imbalanced? What is the pattern of the imbalance and its relationship with the explanatory variables? How many explanatory variables? Of what types, and how are they inter-related? How many observations are there? *Etc, etc.* If you want to get into such issues, you need to provide the specifics of the analysis this reviewer was commenting on. That's why I offer a small example showing *how* you can assess "robustness." — whuber, May 25 '21 at 18:29
I guess that the reviewer just wants me to comment/discuss the bias of estimates / how the imbalance in the data affects the results and not to assess specifically their robustness (as you did, for example). That is why I said that I'm looking for a theoretical argument (eg. a book that more or less generically summarizes what you wrote) — MDSF, May 25 '21 at 21:48
An argument specific to your data and your analysis might be more powerful and more convincing than an appeal to authority. — whuber, May 25 '21 at 22:35

Biased estimates in logistic regression due to class imbalance

1 Answers1

Linked