I was reading an answer explaining the justification for using the sigmoid function in logistic regression. The reason given was essentially "when evidence adds up, the odds multiply". This is the exact excerpt:
For example, there are two Boolean features of an animal: x1 says whether it has a long tail, x2 says whether it is small sized. The two features are used to predict y: whether it is a rat. When either of the features is activated (true), the likelihood ratio is 3. When both are activated, the probability that it is a rat is 9 times of the probability it isn't.
You may notice that independence between different feature effects is assumed.
Now I'm struggling with how exactly, he got to that probability ratio of 9. Essentially, we are trying to compute:
$\frac{P(Y=1|x_1,x_2)}{P(Y=0|x_1,x_2)}$
How exactly can we transform that to this:
$\frac{P(Y=1|x_1) \cdot P(Y=1|x_2)}{P(Y=0|x_1) \cdot P(Y=0|x_2)}$
And how can we formulate the "independence between feature effects" mathematically?
My considerations so far
I tried to reduce the term $P(Y=1|x_1,x_2)$ to something containing $P(Y=1|x_1) \cdot P(Y=1|x_2)$ multiplied by a constant term that does not depend on $Y$. So essentially getting
$P(A|B,C) = P(A|B) \cdot P(A|C) \cdot T$ with $T$ not containing $A$.
But using the basic probability laws, I only got
$P(A|B,C) = P(A,B|C)\cdot \frac{1}{P(B|C)} = P(B|A,C) \cdot P(A|C) \cdot \frac{1}{P(B|C)}$
Now, I'd need to transform $P(B|A,C)$ to $P(A|B)$ times a term not containing $A$. And at that point I am stuck.