Variational inference with deterministic dependencies between variables

Question

Suppose I have a probabilistic graphical model shown in the picture, in which all variables are binary, $c_1$ and $c_2$ are observed, and I want to use mean-field variational inference to estimate beliefs about the remaining variables. Suppose further - and this is crucial to the problem - that the two $b$'s are constrained to be identical to $a$: if $a$ is true, both $b$'s are true as well, and if $a$ is false, both $b$'s are also false. That is, $p(b_i|a)=[a=b]$.

The full posterior of the unobserved given the observed variables is given by $$ p(a,b_1,b_2|c_1,c_2) = \frac{p(c_1|b_1)p(c_2|b_2)p(b_1,b_2|a)p(a)}{p(c_1,c_2)} $$ We want to approximate this joint posterior with a factorized distribution $q(a,b_1,b_2)=q(a)q(b_1)q(b_2)$, by minimizing the KL-divergence: $$ D_{KL}(q||p)=E_q[\log q(a,b_1,b_2)-\log p(a,b_1,b_2|c_1,c_2)] $$ $$ =\sum_a q(a)\log q(a)+\sum_{b_1}q(b_1)\log q(b_1)+\sum_{b_2}\log q(b_2) -\sum_{b_1}q(b_1)\log p(c_1|b_1)-\sum_{b_2}q(b_2)\log p(c_2|b_2) - \sum_{a,b_1,b_2}q(a)q(b_1)q(b_2)\log p(b_1,b_2|a)- \sum_a q(a)\log p(a) + \log p(c_1,c_2) $$ This includes one term that is very problematic: $\log p(b_1,b_2|a)$. The probability inside the log equals either 1 or 0. It equals 1 when $a=b_1=b_2$ (i.e. when they are all true or all false), and 0 otherwise. The problem is that $\log 0 = -\infty$. Thus, if $q(a,b_1,b_2)$ assigns any probability mass to assignments for which $a$, $b_1$ and $b_2$ are not identical, the KL-divergence goes to infinity. Therefore, there are only two "legal" outcomes for variational inference here: we can either have $q(a=1)=q(b_1=1)=q(b_2=1)=1$, or $q(a=0)=q(b_1=0)=q(b_2=0)=1$.

This isn't very attractive, since the evidence provided by $c_1$ and $c_2$ might not be very informative about $b_1$ and $b_2$ (and, consequently, about $a$), and so really we would like our inferences to reflect this uncertainty, and assign some probability to both options (true or false) for $a$, $b_1$ and $b_2$. Variational inference, instead, will collapse onto the mode of the posterior and discard all this uncertainty, which is no better than doing MAP-inference.

My understanding is that this is quite a well-known issue with variational inference, but my question is: is there any solution or workaround? Is there a different way of stating or approaching the problem that allows us to make progress and preserve the uncertainty that we're interested in? Or is the only way to avoid it to use a different (approximate) inference algorithm (e.g. belief propagation)?

Perhaps this is counter to the spirit off the example but, if $b_1$ and $b_2$ are both known to be identical to $a$, why include them in the model at all? — user20160, Aug 11 '19 at 18:25

Chris Cundy · Answer 1 · 2019-08-15T18:42:52.593

The variational approach still gives reasonable results in this setting.

As pointed out in the comments, we could in this case combine $a, b_1, b_2$ into a single node $h$ and we would have a three-node graph with $c_1, c_2$ the direct descendants of $h$. Variational inference won't have any unexpected behaviour in this combined reparameterization, so we would hope that the variational approach will also deal with the expanded case.

Minimizing the KL-divergence between $q$ and $p$ gives you the objective $$ \min_q D_\text{KL}(q\|p) = \min_q \ \mathbb{E}_q\left[\log q(a,b_1,b_2) - \log p(a,b_1,b_2|c_1,c_2)\right] $$ as stated in the question. We can also see that for any configuration of $a,b_1,b_2$ where they are not equal, the log term is infinity (as we have assumed in the set-up that $a,b_1,b_2$ are deterministically equal). The only way to avoid the objective becoming infinite is for $q$ to put no density on those configurations. A quick way of seeing this is to remember that $D_\text{KL}(q\|p)$ is infinite where $p$ has no support.

However, this doesn't mean that $q$ has to either equal $1$ at $a,b_1,b_2 = 0$ or $a,b_1,b_2=1$. We could have $q(a=0,b_1=0,b_2=0) = 1/2, q(a=0,b_1=0,b_2=0) = 1/2$, in which case the KL-divergence is still finite.

For concreteness, let's show this. We know that $q$ is nonzero only at $a,b_1,b_2 = 0$ or $a,b_1,b_2=1$. Let's define the first value as $q_0$ and the second as $q_1$, with $q_0 = 1-q_1$.

Then we have

$$ \mathbb{E}_q\left[\log q(a,b_1,b_2) - \log p(a,b_1,b_2|c_1,c_2)\right] = q_1\log q_1 + q_0 \log q_0 - q_1\left[\log p(b_1=1|c_1) + \log p(b_2=1|c_2)\right] - q_0\left[\log p(b_1=0|c_1) + \log p(b_2=0|c_2)\right]. $$ Writing $q_0 = 1-q_1$, we get the expectation as $$ q_1\left(\log q_1 + q_1\log\left[\frac{p(b_1 =0|c_1)p(b_2=0|c_2)}{p(b_1=1|c_1)p(b_2=1|c_2)}\right]\right) - \log p(b_1 = 0|c_1) -\log p(b_2=0|c_2) - q_1\log (1-q_1) + \log(1-q_1). $$

As a check, what happens when $p(b_1=0|c_1) = p(b_1=1|c_1) = p(b_2=0|c_1) = p(b_2=1|c_1) = 1/2$, i.e. after seeing each piece of evidence we think that each of the $b$s are equally likely to be $0$ or $1$? We have as our expectation

$$ q_1 \log q_1 + 2\log 2 - q_1\log(1-q_1) + \log(1-q_1), $$ and differentiating with respect to $q_1$ and setting to zero gives \begin{align} \log q_1 + 1 - \log (1-q_1) + \frac{q_1}{1-q_1} - \frac{1}{1-q_1} = 0 \\ \log \frac{q_1}{1-q_1} = 0\\ q_1 = \frac{1}{2}, \end{align} which is the minimum. So in this case, the optimal variational distribution $q$ spreads equal probability mass over the two possible areas which it can have nonzero value.

Comment Reading the question again, I see that you specify using mean-field variational inference, so that $q(a,b_1,b_2) = q_a(a)q_{b_1}(b_1)q_{b_2}(b_2)$. In that case, we do get the behaviour described, where the approximating distribution is a very poor fit. This is because the mean-field approximation uses a variational distribution which has independence between the latent variables. In this case, the latent variables are completely dependent, so the variational family fits the posterior extremely badly. It's generally true that for a variational family, you can cook up a distribution which it will fit very badly. The solution is generally to use a more flexible variational family.

Thanks for your detailed answer, which matches my own understanding of the problem. I timed my question rather badly because I was traveling last week, so I didn't have the opportunity to make edits or reply to comments to clarify the kind of answer I was looking for. I may get back to you about that later (if you don't mind) but for now enjoy a well-deserved bounty! — Ruben van Bergen, Aug 18 '19 at 16:32

Variational inference with deterministic dependencies between variables

1 Answers1