34

I read somewhere that Variational Bayes method is a generalization of the EM algorithm. Indeed, the iterative parts of the algorithms are very similar. In order to test whether the EM algorithm is a special version of the Variational Bayes, I tried the following:

  1. $Y$ is data, $X$ is the collection of latent variables and $\Theta$ is the parameters. In Variational Bayes we make can make a approximation such that $P(X,\Theta|Y) \approx Q_X(X)Q_\Theta(\Theta)$. Where $Q$s are simpler, tractable distributions.

  2. Since the EM algorithm finds a MAP point estimate, I thought that Variational Bayes can converge to EM if I use a Delta Function such that: $Q^1_\Theta(\Theta)=\delta_{\Theta^1}(\Theta)$. $\Theta_1$ is the first estimate for the parameters as usually done in EM.

  3. When $Q^1_\Theta(\Theta)=\delta_{\Theta^1}(\Theta)$ is given, $Q^1_X(X)$ which minimizes the KL Divergence is found by the formula $$Q^1_X(X)=\frac{\exp(E_{\delta_{\Theta^1}}[\ln P(X,Y,\Theta)])}{\int\exp(E_{\delta_{\Theta^1}}[\ln P(X,Y,\Theta)])dX}$$ The formula above simplifies to $Q^1_X(X)=P(X|\Theta^1,Y)$ , this step turns out to be the equivalent of the Expectation step of the EM algorithm!

But I can't derive the Maximization step as the continuation of this. In the next step we need to calculate $Q^2_\Theta(\Theta)$ and according to Variational Bayes iteration rule this is:

$$Q^2_\Theta(\Theta)=\frac{\exp(E_{P(X|\Theta^1,Y)}[\ln P(X,Y,\Theta)])}{\int\exp(E_{P(X|\Theta^1,Y)}[\ln P(X,Y,\Theta)])d\Theta}$$

Are VB and EM algorithms really connected in this way? How can we derive EM as a special case of the Variational Bayes, is my approach true?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Ufuk Can Bicici
  • 2,028
  • 1
  • 17
  • 26
  • Where did you read that the EM algorithm finds a MAP estimate? The relationship between variational inference and EM will become clear once you understand the [view of EM presented in this paper by Neal & Hinton (1998)](ftp://ftp.cdf.toronto.edu/dist/radford/emk.pdf). See also my answer [here](http://stats.stackexchange.com/questions/64962/motivation-of-expectation-maximization-algorithm/64986#64986). – Lucas Jul 03 '14 at 13:25
  • I think I learned the EM algorithm in the same way as this paper explains, it is viewed as a lower bound maximization problem. Using Jensen's equality and Calculus of variations, one finds that in the expectation step, $P(X|\Theta^t,Y)$ is the distribution which maximizes the lower bound for $\Theta^t$ and in the maximization step, one finds $\Theta^{t+1} = arg max_{\Theta} _{P(X|\Theta^t,Y)}$ , which is a maximum on the lower bound. So, this is similar to the Variational Bayes. (And it converges to a local maximum of the marginal posterior, hence a MAP estimate) – Ufuk Can Bicici Jul 03 '14 at 13:41
  • 1
    Apologies, I didn't read your question carefully enough. I believe your maximization step to compute $Q_\Theta^2$ is only valid if you allow any distribution, that is, if you only make the factorization assumption. But you additionally assumed that $Q_\Theta^2$ is a delta distribution. Try to explicitly maximize the lower bound with respect to $\Theta^2$, the parameter of $Q_\Theta^2(\Theta) = \delta_{\Theta^2}(\Theta)$. – Lucas Jul 03 '14 at 14:09
  • I found in the page 21 of the presentation http://www.cs.cmu.edu/~tom/10-702/Zoubin-702.pdf a comparison of EM and VB has been shown, similarly by using the Dirac function. But how VB reduces to EM is not given. – Ufuk Can Bicici Jul 03 '14 at 14:10

1 Answers1

27

Your approach is correct. EM is equivalent to VB under the constraint that the approximate posterior for $\Theta$ is constrained to be a point mass. (This is mentioned without proof on page 337 of Bayesian Data Analysis.) Let $\Theta^*$ be the unknown location of this point mass: $$ Q_\Theta(\Theta) = \delta(\Theta - \Theta^*) $$ VB will minimize the following KL-divergence: $$ KL(Q||P)=\int \int Q_X(X) Q_\Theta(\Theta) \ln \frac{Q_X(X) Q_\Theta(\Theta)}{P(X,Y,\Theta)} dX d\Theta \\ = \int Q_X(X) \ln \frac{Q_X(X) Q_\Theta(\Theta^*)}{P(X,Y,\Theta^*)} dX $$ The minimum over $Q_X(X)$ gives the E-step of EM, and the minimum over $\Theta^*$ gives the M-step of EM.

Of course, if you were to actually evaluate the KL divergence, it would be infinite. But that isn't a problem if you consider the delta function to be a limit.

Tom Minka
  • 6,610
  • 1
  • 22
  • 33
  • Technically, maximizing $\mathbb{E}_{Q_x}[\ln P(X, Y, \Theta^*)] = \mathbb{E}_{Q_x}[\ln P(X, Y | \Theta^*)] + \ln P(\Theta^*)$ w.r.t. $\Theta^*$ corresponds to the M-step of MAP-EM (with prior $P(\Theta^*)$). -- section 3.1 of the [VBEM paper](http://mlg.eng.cam.ac.uk/zoubin/papers/valencia02.pdf) – Yibo Yang May 13 '18 at 05:54
  • 1
    Isn't $Q_\Theta(\Theta^*)=\infty$ ? The entropy of a Dirac distribution is hidden here, which is negative infinity... – Maverick Meerkat Aug 21 '21 at 16:02