How exactly do the odds multiply when "evidence" adds up?

Question

I was reading an answer explaining the justification for using the sigmoid function in logistic regression. The reason given was essentially "when evidence adds up, the odds multiply". This is the exact excerpt:

For example, there are two Boolean features of an animal: x1 says whether it has a long tail, x2 says whether it is small sized. The two features are used to predict y: whether it is a rat. When either of the features is activated (true), the likelihood ratio is 3. When both are activated, the probability that it is a rat is 9 times of the probability it isn't.

You may notice that independence between different feature effects is assumed.

Now I'm struggling with how exactly, he got to that probability ratio of 9. Essentially, we are trying to compute:

$\frac{P(Y=1|x_1,x_2)}{P(Y=0|x_1,x_2)}$

How exactly can we transform that to this:

$\frac{P(Y=1|x_1) \cdot P(Y=1|x_2)}{P(Y=0|x_1) \cdot P(Y=0|x_2)}$

And how can we formulate the "independence between feature effects" mathematically?

My considerations so far

I tried to reduce the term $P(Y=1|x_1,x_2)$ to something containing $P(Y=1|x_1) \cdot P(Y=1|x_2)$ multiplied by a constant term that does not depend on $Y$. So essentially getting

$P(A|B,C) = P(A|B) \cdot P(A|C) \cdot T$ with $T$ not containing $A$.

But using the basic probability laws, I only got

$P(A|B,C) = P(A,B|C)\cdot \frac{1}{P(B|C)} = P(B|A,C) \cdot P(A|C) \cdot \frac{1}{P(B|C)}$

Now, I'd need to transform $P(B|A,C)$ to $P(A|B)$ times a term not containing $A$. And at that point I am stuck.

score 1 · Accepted Answer · answered Dec 20 '17 at 00:15

This example is confusing because

it suggests (though never states) that the input variables $x_1,x_2$ are discrete, and
they probabilities are not really conditional probabilities.

Logistic regression takes input from a continuous variable and maps that to a probability of belong to a discrete output (a binary classification of either 0 or 1). For some (previously unseen) value $X=x$, it returns a value $P(Y=1)$. Since there are only two classes, $P(Y=0)$ follows trivially:

$$P(Y=0) = 1-P(X=1).$$

These are not conditional probabilities in the sense that

$$P(Y=1 \, \cap \, X=x) \neq P(Y=1 |X=x)P(X=x),$$

so I'm going to go ahead and call this out as an abuse of notation. My proposal would be to replace the bar with a semicolon: $P(Y=1|x) \rightarrow P(Y=1;x)$. The fact that the odds multiply is really just a feature of the logistic model; you won't be able to see it as a consequence of the laws of probability.

If you're concerned about $x_1$ and $x_2$ being correlated, I would plot them together and check the covariance matrix. Detecting and quantifying the effect of correlated independent variables is beyond the scope of this answer; see What is the effect of having correlated predictors in a multiple regression model?

How exactly do the odds multiply when "evidence" adds up?

My considerations so far

1 Answers1