1

If I have a binary variable, say sex, and I want to test whether multiple other variables are associated with it. To do this, I run a logistic regression of the form

\begin{equation} logit(probability(sex = male)) = \beta X1 + \beta X2 ... + \beta Xk \end{equation}

Once I do this, I calculate the pseudo R-squared, and it is 0.45 meaning that the regression explained 45% of the variance in sex.

My question is, is it also fair/correct to interpret this as sex explained 45% of the variance in the regressors?

Likewise, if the odds ratio for X1 is 2.0, can one claim that being male increased the odds of X1 occurring by 100%?

Basically, can the equal sign in the regression equation truly be treated as an equal sign (i.e., bi-directional equivalence) if I have no claim to directionality?

Carlos Cinelli
  • 10,500
  • 5
  • 42
  • 77
JRF1111
  • 63
  • 11
  • The linear equation estimated by the model is an equality, but that doesn't apply to the inferences made by the model. – david25272 Aug 21 '17 at 00:58
  • Is there any specific reason you tagged "causality"? – Carlos Cinelli Aug 21 '17 at 06:00
  • 1
    Adding to other answers: we can talk about "explained variance" only when conducting linear regression. For GLMs $R^2$ has nothing to do with explained variance. – Tim Aug 21 '17 at 06:25
  • @carloscinelli I tagged causality because, if interferences are directionally constrained, that is a sort of causal inference. For example, if we can claim that X has an effect on Y but not the inverse, we at least know that Y doesn't cause X. – JRF1111 Aug 21 '17 at 22:49
  • @JRF1111 ok, that opens a whole lot of different answers. Causality is definitely directionally constrained, that is, for causal models the equal sign is not a literal equality, it's more like an assignment operator, and the relationship is asymmetric $y \leftarrow x$ . – Carlos Cinelli Aug 21 '17 at 23:19
  • @Tim pseudo R-squares actually do give an indication of how much more variation is explained by a model than randomly guessing (with the null model) 0 vs. 1. The issue is that there isn't one established measure for logistic regression like in OLS. Some use improvement in log likelihood, some use % correctly classified, one uses the correlation between y and yhat & one uses mean difference in predicted probability for y=1 & y=0. I'm actually doing a simulation study evaluating about 20 for logistic regression. I can tell you that they do explain random variation, they just do it differently. – JRF1111 Aug 22 '17 at 00:14
  • No it does not. First, [it can be misleading](https://stats.stackexchange.com/q/3559/35989). Second, the $R^2$ simply measures the variance explained (yet, also can be misleading), all the pseudo- $R^2$ are just approximations of it that need you to make pretty much assumptions. – Tim Aug 22 '17 at 06:10

1 Answers1

2

You do have a claim to directionality, it is just implicit.

A more explicit statement of the logistic model is:

$$ y \mid x \sim \text{Bernoulli}(p = logit^{-1}(\beta_0 + \beta_1 x_1 + \beta x_2 ... + \beta x_k)) $$

Here, $\sim$ means "is distributed as".

You are making a distributional statement about $y$ conditional on $x$, and conditioning is not symmetric. Because of the conditional, the $x$'s are not considered random by the model, so $y$ cannot be thought of as explaining variance in the $x$'s (at least from the point of view of the model).

Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
  • Great response. Perhaps this is best left to a new question, but do you know of some methods that allow for binary, categorical, and non-normal variables where the conditioning is symmetric or where the xs are conditional on y? – JRF1111 Aug 20 '17 at 22:44