Lots of articles say that MLE is as same as minimizing cross-entropy.
I tried to prove this but failed.
the relationship between maximizing the likelihood and minimizing the cross-entropy
This article has the same problem, but I could not understand it.
$\,$
For example, I have several data points $X_i\,(i=1,..., N)$
These points are distributed as $X \sim P_{data}(X)$
Let, I want to approximate $P_{data}(X)$ with some parameters.
Let this approximated model is $P_{model}(X;\theta)$.
$\,$
First, I tried MLE.
$\theta^*=argmax\,\,\prod_{i=1}^{N}P_{model}(X_i;\theta)$
$\,\,\,\,\,\,\,=argmax\,\,\log (\prod_{i=1}^{N}P_{model}(X_i;\theta))$
$\,\,\,\,\,\,\,=argmax\,\, \sum_{i=1}^{N}\log (P_{model}(X_i;\theta))$
$\,$
Second, I tried minimizing cross-entropy.
$\theta^*=argmin\,H(P_{data}(X),P_{model}(X_i;\theta))$
$\,\,\,\,\,\,\,=argmin\,\, E_{X\sim P_{data}(X)} [-log(P_{model}(X_i;\theta))]$
$\,\,\,\,\,\,\,=argmax\,\, E_{X\sim P_{data}(X)} [log(P_{model}(X_i;\theta))]$
$\,\,\,\,\,\,\,=argmax\,\, \sum_{i=1}^{N} P_{data}(X_i)\log (P_{model}(X_i;\theta))$
$\,$
OK. Here I have the different result.
In cross-entropy, $P_{data}(X_i)$ is multiplied.
Why does this happen?
And also, how can be cross-entropy calculated?
Because we do not know $P_{data}(X_i)$ in general case.
I'm really curious about this. Kind explanation will be greatly appreciated.