MLE for logistic regression, formal derivation

Question

I am currently working through Bishops' Pattern Recognition and Machine Learning where the following issue came up.

It is closely related to the unanswered post below, but I wanted to propose a more formal approach. Confusion about the use of the MLE & the posterior in parameter estimation for logistic regression

The confusion arises in Bishops chapter 4, when he introduces logistic regression for a two-class problem where he estimates the posterior $p(C\mid x)$ by ML. Just a few paragraphs above he had shown how to calculate the likelihood for MLE estimates of means and variances of two Gaussian class-conditional distributions. Therefore, the product of the joint distribution accross all samples is calculated and the log-likelihood is then minimized.

Introducing the MLE for logistic regression for the $w$ parameters in the sigmoid $\sigma(w^Tx)$ however, it appears that he only takes the product of the posterior probabilities $p(C=t_i \mid x_i)$ (approximated for members of the exponential family by sigmoids $\sigma(w^Tx)$) and comes up with the logistic cross-entropy loss function $$\ell(w) = \sum_i t_i \log(\sigma(w^Tx))+(1-t_i)\log(1-\sigma(w^Tx)).$$ Then he goes on discussing properties of the function and minimization algorithms.

Now, my problem: Why can he apparently start logistic regression MLE from the product of posteriors $\prod_i p(C=t_i\mid x_i)$? If you like, in the post cited above you can find an (incomplete) motivation for this that I suggested.

Here I wanted to propose a slightly different approach for an answer and ask for your opinion.

Isn't he actually only pararmetrizing the posterior with the sigmoid function? So maybe a more complete derivation of log-reg. MLE could read:

\begin{align} \ell(w) &= \log\prod_{i=1}^N p(C=1, x_i)^{t_i} \cdot p(C=0, x_i)^{1-t_i} \\[8pt] &= \log\prod_{i=1}^N p(C=1\mid x_i)^{t_i} \cdot p(C=0\mid x_i)^{1-t_i}~~p(x_i) \\[8pt] \end{align}

and only then parametrize $p(C\mid x)=\sigma(w^Tx)$ to obtain $$\ell(w) = \sum_i t_i \log(\sigma(w^Tx))+(1-t_i)\log(1-\sigma(w^Tx))+log(p(x_i)).$$

Finally, as the marginal $p(x)$ is not parametrized with $w$ it will not influence the minimum-location w.r.t. $w$.

Intuitively this seems to make some sense, as the log-reg just gives a (linear) discriminant depending on the targets and does not provide (being a probabilistic discriminant) an estimate for the marginal/unparametrized $p(x)$.

Is this a valid starting point for thinking about this?

score 0 · Accepted Answer · answered May 01 '19 at 07:20

Unfortunately I did not find these last night, but the question was answered in these two posts pretty much along the lines I was thinking here.

What makes the formula for fitting logistic regression models in Hastie et al "maximum likelihood"?

MLE vs MAP vs conditional MLE with regards to logistic regression

MLE for logistic regression, formal derivation

1 Answers1

Linked