The explanation for the need to compute (rather then optimize) the posterior of latent variables

Question

The most common usage of the variational inference looks like to be in computing the marginal distribution $P(X)$ in the denominator of the Bayes formula when computing the posterior probability of the hidden variables, $P(Z|X)$. This is likely a dumb question but I don't understand why we need to compute $P(Z|X)$. Why don't we just optimize $P(Z|X)$ instead (find $Z$ that maximizes $P(Z|X)$), in which case we don't need to exactly compute $P(X)$ since it does not depend on the value of $Z$. In that alternative way, we can just optimize the numerator $P(X|Z)P(Z)$ as in the regular MAP (maximum aposteriori) estimation of the parameters ($\theta$) when we don't have any hidden variables $Z$. What is the difference in those two problems (estimating parameters vs. hidden variables) such that one can go with optimizing through MAP while the other needs sophisticated tools for computing (or approximating) $P(Z|X)$?

Edit: After a little more search, I learned that the MAP is not really full Bayesian, since it does not learn the actual posterior distribution, but it instead gets a point estimate of the parameters through maximizing posterior probability. Then I think what variational inference does is real Bayesian. But still, I am not sure whether I understand the need for getting the distribution itself. Is it to get a better sense of the parameter space?

The process of computing the marginal distribution $P(X)$ results in a generative network. Variational Autoencoders are one of the ways of computing them using variational inference. Here, we do not have any information about the $Z$ thus we cannot do MLE for $P(Z|X)$, we assume that it's coming from a gaussian prior $\mathcal{N}(\mu,\sigma \textbf{I})$. We try to match the original posterior $P(Z|X)$ to that of the computed one $Q(Z|X)$ by reducing the KL distance between them. I am not sure how correct my thinking is though. — saha rudra, Jan 03 '18 at 02:35

dontloo · Accepted Answer · 2018-01-03T07:40:38.823

2

In many cases getting the distribution is more useful than a point estimation. Say I am an interviewer and I ask each candidate to do a list of questions, then $X$ (observed) could be the number of question the candidate answered correctly, and $Z$ (latent) is whether the candidate is qualified or not.

Candidate A has correctly answered 4 out of 5 questions, and B answered all 5 questions. If I just do a point estimation (MLE/MAP), the most probable value of $Z$ for A and B might be the same $$argmax_ZP(Z|X_A)=argmax_ZP(Z|X_B)=qualified$$ If I can know the full posterior distribution, then I get $$P(Z=qualified|X_A)<P(Z=qualified|X_B)$$

A more common situation would be there's another variable $Y$ that depends on $Z$, if we know the full distribution $p(Z|X)$ we get $$p(Y|X)=\int p(Y|Z)p(Z|X)d_Z$$ which we can't know given only a point estimation of $Z$.

For instance in the setting of recursive filtering algorithms, $Z$ is the state at time $t$, $X$ is the measurement at time $t$, and $Y$ is the state at time $t+1$.

Another example that might be interesting to illustrate the connection is the EM algorithm for Gaussian mixture models, in wich case $X$ is the observed data, $Z$ is the discrete latent variable that denotes which Gaussian component a data point belongs to.

EM itself is an optimization (point estimation) algorithm for the model parameters $\theta$. In the E step we assign each data point a probability of belonging to each component, which is to compute $p(Z|X)$. in the M step we optimize $\theta$ based on $p(Z|X)$.

If in the E step we do a point estimation (hard assignment) instead, then we are going from EM for GMM to K-means clustering (assuming the same identity covariance matrix).

I'm not much a statistics people, please correct me if I missed any important point. :]

edited Jan 03 '18 at 07:40

answered Jan 03 '18 at 04:54

dontloo

13,692
7
51
80

The main point in computing the marginal $p(y|x)$ is when comparing models. – Xi'an Jan 03 '18 at 05:13
@Xi'an Could you give an example? How would you use distributions in comparing two models? How would you decide which distribution is associated with the *better* model? What is the metric? Would you compare the sufficient statistics of the distributions? – user5054 Jan 03 '18 at 05:15
@user5054: the notion of relevance is the Bayes factor. – Xi'an Jan 03 '18 at 05:21
The Wikipedia article says "The Bayes factor is a ratio of the likelihood probability of two competing hypotheses, usually a null and an alternative." Then Bayes factor is basically the terminology for the comparison method used in the interview example in @dontloo 's answer. – user5054 Jan 03 '18 at 05:28
1

@dontloo Thanks, the answer is very helpful. Just to note, EM example could be a little bit confusing, because EM retrieves a point estimate of the expected likelihood of data in the maximization step, while when computing the posterior P(Z|X) in the expectation step, it computes a probability rather than a point estimate, which is I think what you aimed to highlight, related to what I asked. – user5054 Jan 03 '18 at 05:31
@user5054 yes you're right, I made some edits. – dontloo Jan 03 '18 at 07:42
@Xi'an hi thank you much for the comments. I'm not quite familiar with model comparison though, I guess in this case $z$ is the parameter, but i'm not sure whether $p(z|x)$ should be interpreted as the prior given the model or the posterior given the training data? – dontloo Jan 03 '18 at 07:50

The explanation for the need to compute (rather then optimize) the posterior of latent variables

1 Answers1

Linked

Related