0

The idea behind logistic regression is to estimate the posterior class conditional probability, given observation x for class C_k, with a sigmoid f(C_k| x)=1/(1+exp(-w*x)) to compute the weights vector w.

In every book I've read (e.g., Bishop's PRML) f(C_k| x) is a probability density function but this is definitely not a valid pdf since the integral from minus infinity to infinity does not equal to 1 (nor it could be by any normalization since the integral is infinite).

Appreciate any explanations in this matter

Benny K
  • 121
  • 3
  • 3
    https://stats.stackexchange.com/questions/69820 and https://stats.stackexchange.com/questions/91473 look like they might answer this question. Another approach is to ask the same question for ordinary least squares regression. Now the response density is $$f(y\mid x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2\sigma^2}(y-\alpha-\beta x)^2\right).$$ Although you *can* integrate this over all $x,$ you usually won't get $1$ as the answer. The problem is that this integral is unrelated to the regression, *because it is taking some kind of average over the regressor $x.$* – whuber Jan 03 '22 at 21:21
  • 1
    $f(C_k|X=x)$ is a probability mass function for the classes conditional on $X = x$, so the joint distribution of $(X, C_k)$ would be $f(x)f(C_k|x)$, where $f(x)$ is the pdf or pmf of $X$. If you summed over the classes and integrated (summed) on $X$ the function $f(x)f(C_k|x)$, you should get $1$, is that not correct? In logistic regression, we usually do not care about $f(x)$ since we take $X$ as fixed, so we only model $f(C_k|X)$. – Lucas Prates Jan 03 '22 at 21:44
  • 1
    The logistic function that you give is a distribution function, not a density function. The distribution ranges from 0 to 1. The logistic density is the derivative of the function you give. – David Smith Jan 04 '22 at 00:24

1 Answers1

3

You got it wrong, there is not integral from -inf to +inf. It is a discrete distribution p(c_k | x), and in case of logistic regression you have two classes c=1 and c=0. The model outputs the probability of belonging to class c=1. If you subtract p(c = 1 | x) from 1, you get the probability of the other class: p(c = 0 | x) = 1 - p(c = 1 | x). Softmax regression extends this to more than two classes by applying the softmax instead of the sigmoid or logistic function.

leo
  • 146
  • 2
  • 2
    This simple explanation is sufficient I guess, but to make it more smooth see [logistic loss](https://en.wikipedia.org/wiki/Loss_functions_for_classification#Logistic_loss) as @leo pointed out, p(1|x) + p(0|x) always add up to 1, so that's why it forms a pdf in a classification task. Do not confuse "logistic function forms a pdf" (incorrect) and "logistic loss forms a pdf" (correct). – null Jan 03 '22 at 15:27
  • @leo - even for discrete distribution, it should be summed to one, isn't it? – Benny K Jan 03 '22 at 16:00
  • 1
    It does sum to one. I think your difficulty is in notation, it gets easier to read after a while. First ignore the conditioning on "x". there are two possible option, c=0 and c=1. It sums up to one. p(c=0) + p(c=1) = 1. The fact that you are conditioning on "x" and how people interpret p(.|.) in each problem is another story, but conditional probability are still probability. I think you are confusing your self whether the integral is on "x" or "c" here. p(C | X) is a valid probability mass function for C, not for "X". –  Jan 03 '22 at 17:11