Clarification of Equation for Variational Inference in Pattern Recognition and Machine Learning

Question

I am looking at the derivation of variational inference and specifically the approach taken by Bishop in his book on page 465 as illustrated in the Figure below. The key step is the statement below Equation 10.8 in which he says "... Thus maximizing (10.6) is equivalent to minimising the KL Divergence, ..."

However in the Errata document for the book that was created by Yousuke Takada the relevant clip is shown below in the Figure:

We see the statement "However, there is no point in taking the Kullback-Leibler divergence between two probability distributions over different sets of random variables; such a quantity is undefined."

So my questions are, is the statement made by Takada correct, and if so what is the correct derivation of the variational inference algorithm. Secondly, if the statement is incorrect, what is the expression for the KL-divergence of the form presented in Equation (10.6) and how does that form relate to the derivation of the variational inference algorithm?

score 0 · Answer 1 · answered Sep 15 '21 at 06:46

Having spent a few days exploring various derivations of variational inference, I thought I could answer my own question, at least to provide an historical record and closure.

I think the answer comes down to notation differences which in itself raises a question. Anyway, to the answer, I believe that both representations are the same, and by that I mean the following.

First equation (10.7) in Bishop is the same as equation (171) in Takada to put this explicitly we have: \begin{equation} \begin{aligned} \ln \tilde{p}(\boldsymbol X, \boldsymbol Z_j) &= \mathbb{E}_{i \neq j} [ \ln p(\boldsymbol X, \boldsymbol Z)] + \operatorname{const} \text{ - (10.7)} \\ \ln q^*_j(\boldsymbol Z_j) &= \mathbb{E}_{\boldsymbol Z \backslash \boldsymbol Z_j} [ \ln p(\boldsymbol X, \boldsymbol Z)] + \operatorname{const} \text{ - (171)} \end{aligned} \end{equation} The definition for the expectations is the same but a different notation, basically both exclude the $j^{th}$ component.

With that equivalence stated, then equation (10.6) is the same as (168) which can be seen if we inserted (10.7) into (10.6). And finally, (10.8) is the same as (170).

So the summary is that both formulations are saying the same thing, and the main point of contention is the use of $\tilde{p}(\boldsymbol X, \boldsymbol Z_j)$ over $q^*_j(\boldsymbol Z_j)$.

This I guess is where notation seems to fall down. In the spirit of the derivations, in the expectation, it is implicitly assumed that the observed data $\boldsymbol X$ is held constant. This is recognized in (171) but left "hanging" in (10.7). The observed data is variable but held constant, so it is not clear (to me) what the correct way of representing this should be, and whether this truly invalidates the definition of the KL-divergence?

Clarification of Equation for Variational Inference in Pattern Recognition and Machine Learning

1 Answers1