I am reading a paper on quantum ML: A generative modeling approach for benchmarking and training shallow quantum circuits, where it is claimed that:
Following a standard approach from generative machine learning [39], we can minimize the Kullback Leibler (KL) divergence [40] $DK_L[P_D|P_θ]$ from the circuit probability distribution in the computational basis $P_θ$ to the target probability distribution $P_D$. Minimization of this quantity is directly related to the minimization of a well known cost function: the negative log-likelihood $C(θ) = −\frac{1}{D} \sum^D_{d=1} ln(P_θ(x^{(d)}))$.
For reference, the training set $D = (x^{(1)}, · · · , x^{(D)})$ i.i.d. where each $x^{(d)} \in \{-1,+1\}^N$ (N-dimensional binary vectors), $\theta$ are parameters of the model (think neural network weights), $P_{\theta}$ is the model distribution (think neural network). Ref. [39] is a reference to Goodfellow's book.
I cannot figure out how they arrive at the negative log likelihood formula for generative model. Using definitions, I arrive at the following for KL divergence:
$$ DK_L[P_D|P_\theta] = \sum^D_{d=1} P_D(x^{(d)}) \log{\frac{P_D(x^{(d)})}{P_\theta(x^{(d)})}} $$
The following for entropy:
$$ H(P_D) = - \sum^D_{d=1} P_D(x^{(d)}) \log{P_D(x^{(d)})} $$
And the following for the cross entropy:
$$ H(P_D, P_\theta) = H(P_D) + DK_L[P_D|P_\theta] = - \sum^D_{d=1} P_D(x^{(d)}) \log{P_\theta(x^{(d)})} $$
Since this is not a classification task and we have $P_D(x^{(d)})$, I fail to see how we could arrive at negative log likelihood from cross entropy.