Minimizing KL-divergence and log-likelihood for generative machine learning models

Question

I am reading a paper on quantum ML: A generative modeling approach for benchmarking and training shallow quantum circuits, where it is claimed that:

Following a standard approach from generative machine learning [39], we can minimize the Kullback Leibler (KL) divergence [40] $DK_L[P_D|P_θ]$ from the circuit probability distribution in the computational basis $P_θ$ to the target probability distribution $P_D$. Minimization of this quantity is directly related to the minimization of a well known cost function: the negative log-likelihood $C(θ) = −\frac{1}{D} \sum^D_{d=1} ln(P_θ(x^{(d)}))$.

For reference, the training set $D = (x^{(1)}, · · · , x^{(D)})$ i.i.d. where each $x^{(d)} \in \{-1,+1\}^N$ (N-dimensional binary vectors), $\theta$ are parameters of the model (think neural network weights), $P_{\theta}$ is the model distribution (think neural network). Ref. [39] is a reference to Goodfellow's book.

I cannot figure out how they arrive at the negative log likelihood formula for generative model. Using definitions, I arrive at the following for KL divergence:

$$ DK_L[P_D|P_\theta] = \sum^D_{d=1} P_D(x^{(d)}) \log{\frac{P_D(x^{(d)})}{P_\theta(x^{(d)})}} $$

The following for entropy:

$$ H(P_D) = - \sum^D_{d=1} P_D(x^{(d)}) \log{P_D(x^{(d)})} $$

And the following for the cross entropy:

$$ H(P_D, P_\theta) = H(P_D) + DK_L[P_D|P_\theta] = - \sum^D_{d=1} P_D(x^{(d)}) \log{P_\theta(x^{(d)})} $$

Since this is not a classification task and we have $P_D(x^{(d)})$, I fail to see how we could arrive at negative log likelihood from cross entropy.

Related questions [here](https://stats.stackexchange.com/questions/141009/kl-divergence-as-a-negative-log-likelihood-for-exponential-families) and [here](https://stats.stackexchange.com/questions/503914/kl-divergence-log-likelihood-ratio-negative-value) and [here](https://stats.stackexchange.com/questions/468818/machine-learning-negative-log-likelihood-vs-cross-entropy) — msuzen, Dec 31 '21 at 03:27

score 2 · Answer 1 · answered Dec 31 '21 at 02:32

2

The connection you are looking for is that as shown below:

$$ \begin{aligned} \theta_{\min KL} &= \arg\min_{\theta} DK_L\left(P_D||P_\theta\right)\\ &= \arg\min_{\theta} \Big(\mathbb{E}[P_D\log P_D] - \mathbb{E}[P_D\log P_\theta]\Big)\\ &=\arg\min_{\theta} \Big(P_D\log P_D - P_D\mathbb{E}[\log P_\theta]\Big)\quad\text{Since}~P_D \perp\theta\\ &=\arg\min_{\theta}~~-\mathbb{E}\log P_\theta\\ \end{aligned} $$

The last expression above is the negative log-likelihood

answered Dec 31 '21 at 02:32

KU99

339
1
6

Hm sorry I still don't understand how it is possible to arrive from the last expression to the formula for $C(\theta)$ given in the quote.. – karolyzz Dec 31 '21 at 12:06
@karolyzz the Expectation is the average/mean. So we have $\mathbb{E}\log P_\theta =\frac{1}{D}\sum_{d=1}^D\log P_\theta$ – KU99 Dec 31 '21 at 17:49

karolyzz · Answer 2 · 2022-01-01T15:38:27.743

I realised my confusion was stemming from notation and my formulas were wrong (specifically summing over $d$).

Note that we are dealing with a training dataset $D$ that we assume approximates the target probability distribution $P_D$. Therefore, it is possible that there exist some $d_1, d_2$ s.t. $x^{d_1} = x^{d_2}$ (in other words, some $x$ may appear multiple times in the dataset). Let $c_x$ denote the number of times a specific $x$ appears in the dataset $D$. Following notation in the paper:

$$ - \frac{1}{D} \sum^D_{d=1} ln(P_\theta(x^d)) = -\sum_{x \sim P_D} \frac{c_x}{D} ln(P_\theta(x)) = -\sum_{x \sim P_D} P_D(x) ln(P_\theta(x)) = -E_{x \sim P_D} [ln(P_\theta(x)] = H(P_D, P_\theta) $$

Minimizing KL-divergence and log-likelihood for generative machine learning models

2 Answers2