Econometrics: What are the assumptions of logistic regression for causal inference?

Question

I'm trying to understand what are the assumptions for logistic regression when you intend to interpret the parameter as causal? The assumptions for causal OLS regressions is well-known but I can't find a good source for similar assumptions for logistic regressions.

From what I can find on the internet, I think the following assumptions need to hold:

Errors are distributed according to a logistic distribution and are independent of each other
No multicolinearity

My intuition tells me that the independent variables should not be correlated with the error term (no endogeneity) as is in the case of OLS regressions, but I can't find support of this anywhere. Does anyone have a mathematical argument for this? As in where would estimation go wrong?

On the same point, when you're interested in the parameter in front of X1 as the causal parameter and X1 is not correlated with the error term, but X2 is correlated with the error term, although you're not interested in the parameter in front of X2 in a causal sense, can you still run this logistic regression and interpret the coefficient in front of X1 as causal? i.e., would the endogeneity of X2 mess up the parameter estimate in front of X1?

Also I read that the errors are not identically distributed but I'm not sure why. Can anyone explain why this is true?

Are there any other assumptions for logistic regressions when you want to use it for causal inference?

Are you meaning to say _logistic_ regression? There is no error term in that model. — Frank Harrell, Jul 19 '18 at 10:36
No multicolinearity is not a condition, neither for linear or logistic regression. In fact the existence of multiconlinearity is the reason why we add control variables. — Maarten Buis, Jul 19 '18 at 11:39
@FrankHarrell yes, I misspelled logistic! You can think of a logistic regression as $y = 1$ if $x\beta + \epsilon>0$ if and $y = 0$ else, where $\epsilon$ follows the standard logistic distribution. What I meant by the error term is $\epsilon$ — Amazonian, Jul 20 '18 at 00:18
@MaartenBuis actually if independent variables are perfectly multicolinear then their individual coefficients are not identified. — Amazonian, Jul 20 '18 at 00:21
In my.the term multicolinearity is used for "non-perfect" multicolinearity, while perfect multicolinearity is used for perfect multicolinearity. Perfect multicoliniarity typically means you made a logical error when choosing your explanatory variables. I would consider that the real problem, and perfect multicolineatity just as the technical representation of that underlying problem. — Maarten Buis, Jul 20 '18 at 05:33
I don't think that talking about an error term for binary $Y$ is very enlightening. — Frank Harrell, Jul 20 '18 at 11:26

Ben · Accepted Answer · 2021-03-11T21:28:52.470

The capacity to interpret regression relationships as causal generally depends on experimental protocols rather than the assumed structure of the statistical model. Regression models allow us to relate the explanatory variables statistically to the response variable, where this relationship is made conditional on all the explanatory variables in the model. As a default position, that is still just a predictive relationship, and should not be interpreted causally. That is the case in standard linear regression using OLS estimation, and it is also true in logistic regression.

Suppose we want to interpret a regression relationship causally ---e.g., we have an explanatory variable $x_k$ and we want to interpret its regression relationship with the response variable $Y$ as a causal relationship (the former causing the latter). The thing we are scared of here is the possibility that the predictive relationship might actually be due to a relationship with some confounding factor, which is an additional variable outside the regression that is statistically related to $x_k$ and is the real cause of $Y$. If such a confounding factor exists, it will induce a statistical relationship between these variables that we will see in our regression. (The other mistake you can make is to condition on a mediator variable, which also leads to an incorrect causal inference.)

So, in order to interpret regression relationships causally, we want to be confident that what we are seeing is not the result of confounding factors outside our analysis. The best way to ensure this is to use controlled experimentation to set $x_k$ via randomisation/blinding, thereby severing any statistical link between this explanatory variable and any would-be confounding factor. In the absence of this, the next best thing is to use uncontrolled analysis, but try to bring in as many possible confounding factors as we can, to filter them out in the regression. (No guarantees that we have found them all!) There are also other methods, such as using instrumental variables, but these generally hinge on strong assumptions about the nature of those variables.

None of the assumptions you mention are necessary or sufficient to infer causality. Those are just model assumptions for the logistic regression, and if they do not hold you can vary your model accordingly. The main assumption you need for causal inference is to assume that confounding factors are absent. That can be done by using a randomisation/blinding protocol in your experiment, or it can be left as a (hope-and-pray) assumption.

Great answer. I also want to add that confounding is not the only source of a failure to identify causal effects. Conditioning on a mediator or a collider will also prevent the valid interpretation of the parameters of a regression model as causal. — Noah, Jul 23 '18 at 14:08

Graham Wright · Answer 2 · 2021-03-11T15:04:37.723

To add to Ben's great answer here's a basic example of how a regression model (regardless of its type) might not be able to infer causality even if you think you've addressed every "assumption." Let's say we have a dataset from a survey of a bunch of people at a single time point. We run a logistic regression model with "being depressed" as the dependent variable and "opiate use" as the independent variable. Assume that we've totally accounted for all OTHER variables that might confound this relationship, and that all of the other assumptions of the model are satisfied as well. We find a significant, positive relationship.

Does this mean that opiate use causes depression? Maybe. But it might also mean that depression causes opiate use. Or maybe both are true at the same time (but one effect is stronger than the other). If all of the variables are collected at the same point in time, the model is not going to be able to distinguish between these VERY DIFFERENT causal processes. Only by adjusting our research design (e.g. measuring opiate use in one year and depression in the next year) can we solve this problem. Regression alone can't help us.

I think you are referring to Ben's answer, that Kjetil edited. Great point about staggering the time of the 2 variables. — ColorStatistics, Mar 11 '21 at 14:33

score 0 · Answer 3 · answered Jul 22 '18 at 13:38

Answering your question about non-identically distributed error terms: In logistic regression, the logit of the dependent variable is regressed on the predictors and the errors of this regression are, in fact, identically distributed and follow a logistic distribution. However, when back-transformed to the response scale, the error term can only take two values at each level of the linear predictor: $$e_i = 1-\pi_i \quad\vert Y_i = 1\\e_i = -\pi_i \quad\,\,\,\,\,\,\vert Y_i = 0$$ Because $e_i = Y_i - \pi_i$ (and $\pi_i$ is constant), the variance of this error term is equal to the variance of the binary variable $Y_i$. The variance of the binary variable $Y_i$ is given by $\sigma^2(Y_i) = \pi_i(1-\pi_i)$ and is non-constant because it is dependent on the mean $\pi_i$.

Kutner et al. (2005). Applied Linear Statistical Models (Ch. 14)

So $\pi_i$ is constant yet $\pi_i(1-\pi_i)$ is nonconstant? I think I get what you mean, but the phrasing is a bit unfortunate. — Richard Hardy, Mar 11 '21 at 08:16

Econometrics: What are the assumptions of logistic regression for causal inference?

3 Answers3

Linked