My goal is to create an VAE with an Dirichlet distributed latent space. Since the reparametrization trick does not work for the Dirichlet Distribution, I am trying to approximate the Gamma Distribution with the Weibull Distribution, from which I then would generate my Dirichlet distributed random samples.
To calculate the best approximation for a given distribution $Gamma(\alpha,\beta)$ I use the KL-Divergence of both distributions (see: WHAI: WEIBULL HYBRID AUTOENCODING INFERENCE FOR DEEP TOPIC MODELING):
$$ f(k, \lambda) = KL(Weibull(k,\lambda)||Gamma(\alpha,\beta)) $$ $$ f(k, \lambda) = -[\alpha\cdot ln(\lambda)-\frac{\gamma\cdot \alpha}{k}-ln(k)-\beta\cdot\lambda\cdot\Gamma(1+\frac{1}{k})+\gamma+1+\alpha\cdot ln(\beta)-ln(\Gamma(\alpha))] $$
If I set $$beta = 1$$ and differentiate f w.r.t. to $k$ and $\lambda$ I should get:
$$ \frac{\partial{f}}{\partial{k}} = -[\frac{\gamma\cdot \alpha}{k^2}-\frac{1}{k}-\lambda\cdot\Gamma'(1+\frac{1}{k})\cdot(-\frac{1}{k^2})] $$
$$ \frac{\partial{f}}{\partial{\lambda}} = -[\frac{\alpha}{\lambda}-\Gamma(1+\frac{1}{k})] $$
To calculate the best approximation for a given distribution $Gamma(\alpha,1)$ I would now set $\frac{\partial{f}}{\partial{k}} = 0$ and $\frac{\partial{f}}{\partial{\lambda}} = 0$.
My questions are:
- Am I on the right track and if so can anybody help me to express $k$ and $\lambda$ explicitly?
The KL part in the ELBO is - according to DIRICHLET VARIATIONAL AUTOENCODER - calculated like this: $$ KL(Q||P)=\sum log(\Gamma(\alpha_k))-\sum log(\Gamma(\hat{\alpha}_k))+\sum(\hat{\alpha}_k-\alpha_k)\cdot\psi(\hat{\alpha}_k) $$
What choice of prior is advisable here?
The following question is not that important for my endeavor but clearing it up would be helpful: Foreach optimization step within a VAE the posterior is updated. Shouldn't then the prior be set to the latest posterior? How can that work if I choose distinct values for the parameters of the prior? (cf. Auto-Encoding Variational Bayes, Appendix B)
I know that there are other methods (e.g. Implicit Reparameterization Gradients) to achieve my goal, but I want to try this method for practice.