When training a Variational Autoencoder, the function being maximised is the expected lower bound:
$$ \mathscr{L}(\boldsymbol{\theta}, \phi; \mathbf{x}^{(i)}) = -D_{KL}\left(q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})||p_{\boldsymbol{\theta}}(\mathbf{z})\right) + \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})}\left[\log p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})\right] $$
I feel that I understand the practical calculations involved with the first term (KL divergence) reasonably well, since the original paper provides a small derivation (appendix B), which isn't too difficult to follow.
It's the second term that's giving me trouble. Firstly, it's not clear exactly what the subscript on the expectation is actually denoting in this case. Secondly, if our model is producing real values, then how do we determine $p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})$ for even a single sample? Intuitively, the chance of predicting a specific sample is infinitesimally small, even if we output a probability distribution.
I've not been able to find any discussion (that I could understand as such, at least) on how this probability is determined from model outputs. But, if I look at some implementations (implementation 3, implementation 4), then I see that it's just calculated as the Negative Log Likelihood/Cross Entropy, or MSE.
How can the same expectation, given the same type of outputs (real valued; albeit not necessarily bounded the same way) have these different calculations? And how can I approach calculating second term, in general?
Additionally, $p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})$ is referred to as a probability sometimes, and a "model" other times. A model is very different to a probability; how can I reconcile these two descriptions?