I am currently working through Bishops' Pattern Recognition and Machine Learning where the following issue came up.
It is closely related to the unanswered post below, but I wanted to propose a more formal approach. Confusion about the use of the MLE & the posterior in parameter estimation for logistic regression
The confusion arises in Bishops chapter 4, when he introduces logistic regression for a two-class problem where he estimates the posterior $p(C\mid x)$ by ML. Just a few paragraphs above he had shown how to calculate the likelihood for MLE estimates of means and variances of two Gaussian class-conditional distributions. Therefore, the product of the joint distribution accross all samples is calculated and the log-likelihood is then minimized.
Introducing the MLE for logistic regression for the $w$ parameters in the sigmoid $\sigma(w^Tx)$ however, it appears that he only takes the product of the posterior probabilities $p(C=t_i \mid x_i)$ (approximated for members of the exponential family by sigmoids $\sigma(w^Tx)$) and comes up with the logistic cross-entropy loss function $$\ell(w) = \sum_i t_i \log(\sigma(w^Tx))+(1-t_i)\log(1-\sigma(w^Tx)).$$ Then he goes on discussing properties of the function and minimization algorithms.
Now, my problem: Why can he apparently start logistic regression MLE from the product of posteriors $\prod_i p(C=t_i\mid x_i)$? If you like, in the post cited above you can find an (incomplete) motivation for this that I suggested.
Here I wanted to propose a slightly different approach for an answer and ask for your opinion.
Isn't he actually only pararmetrizing the posterior with the sigmoid function? So maybe a more complete derivation of log-reg. MLE could read:
\begin{align} \ell(w) &= \log\prod_{i=1}^N p(C=1, x_i)^{t_i} \cdot p(C=0, x_i)^{1-t_i} \\[8pt] &= \log\prod_{i=1}^N p(C=1\mid x_i)^{t_i} \cdot p(C=0\mid x_i)^{1-t_i}~~p(x_i) \\[8pt] \end{align}
and only then parametrize $p(C\mid x)=\sigma(w^Tx)$ to obtain $$\ell(w) = \sum_i t_i \log(\sigma(w^Tx))+(1-t_i)\log(1-\sigma(w^Tx))+log(p(x_i)).$$
Finally, as the marginal $p(x)$ is not parametrized with $w$ it will not influence the minimum-location w.r.t. $w$.
Intuitively this seems to make some sense, as the log-reg just gives a (linear) discriminant depending on the targets and does not provide (being a probabilistic discriminant) an estimate for the marginal/unparametrized $p(x)$.
Is this a valid starting point for thinking about this?