4

I have cross sectional data and am using logistic regression. My question is how do I check my data for heteroskedasticity and in case it is present, then how to deal with it in Stata.

I have come across a lot of information using linear regression along with the Breusch-Pagan Test (using command - hettest) or White’s Test (using command - imtest) for testing for heteroskedasticity. And heteroskedasticity is dealt with by computation of - Robust Standard Errors. However, there is less information on this issue in case of logistic regression.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Juhee Jain
  • 41
  • 2

2 Answers2

3

Except in a very technical sense (which @BigBendRegion's answer gets at) heteroskedasticity isn't a "thing' in a logistic regression model.

Heteroskedasticity is when the standard deviation of the errors around the regression line (that is the average distance between the predicted Y value at a given X value and the actual Y values in your dataset for cases with those X values) gets bigger or smaller as X increase. Now, many people (myself included) would argue that heteroskedasticity isn't even that big of a problem for LINEAR regression, except when it's caused by other more serious issues (like nonlinearity or omitted variable bias).

But this whole concept doesn't make sense in logit because logit models don't even HAVE error terms, or rather they don't have error terms that come from the data.

To oversimplify greatly, what a logit model actually "does" is run an OLS model on an unobserved latent variable (call it y*) that represents the "propensity" to do whatever it is your binary variable Y is measuring (we assume that people with a y* over some arbitrary threshold get a Y of 1 and everyone else gets a zero). Obviously we don't know what y* looks like, so in order to specify this model we assume that the errors in this OLS model have a logistic distribution (hence the name of the model) with a standard deviation of of $π/\sqrt{3}$ (the probit model assumes they are normally distributed with a standard deviation of 1). Through some calculus we use this assumption about the distribution of the errors in y* to get us to the logit model of Y itself. This means that the logit model doesn't have an error term, because the distribution of the errors is build into the assumptions of the model itself. So it doesn't make sense to talk about whether the errors get bigger or smaller as X increases, which is what heteroskedasticity is.

Graham Wright
  • 1,559
  • 1
  • 11
  • 1
    I disagree that there is no error term. The observed binary response minus the conditional expectation *is* the error term, and its variance is as I stated in my answer. – BigBendRegion Jan 01 '21 at 15:30
  • Any comments on the part of the question asking about why you would need Robust Standard Errors for logistic regression? – Tripartio Apr 20 '21 at 14:10
1

With the logistic regression model, heteroscedasticity is automatically assumed to exist. The conditional distribution of $Y$ given $X=x$ is assumed to be Bernoulli with parameter $\pi(x)$, a probability. The variance of this distribution is $\pi(x)\times (1-\pi(x))$, a nonconstant function of $x$. Likewise, you do not need to worry about normality. You still need to consider the linearity (in the logits) and independence assumptions, however.

BigBendRegion
  • 4,593
  • 12
  • 22
  • Any comments on the part of the question asking about why you would need Robust Standard Errors for logistic regression? – Tripartio Apr 20 '21 at 14:10
  • Why would you need them? The standard errors automatically account for heteroscedasticity correctly. Using robust standard errors would just add noise. – BigBendRegion Apr 21 '21 at 19:11