I am deriving a Gibbs sampler with a model similar to the model in this paper (a graphical model is shown in page 4). To put it simple, my question only concerns $w_i$ (a $K$-dimensional vector drawn from a normal distribution) and its precision $\gamma_w$, thus I will just make these 2 variables unknown and treat the rest as known (hopefully without losing the mathematical rigors):
$w_i \sim \mathcal{N}(0, \sigma_w^2I_K)$
$\gamma_w = \frac{1}{\sigma_w^2} \sim \Gamma(\alpha_0, \beta_0)$, here $\alpha_0, \beta_0$ are hyperparameters.
And the conditional probability is:
$p(w_{ik}|-) \sim \mathcal{N}(\mu_{w_{ik}}, \sigma_{w_{ik}}^2)$, where $\sigma_{w_{ik}}^2 = (\gamma_w + T)^{-1}$. $T$ is actually a pretty complicated term but it's nonnegative.
$p(\gamma_w|-) = \Gamma(\alpha_0 + \frac{1}{2}KN, \beta_0 + \frac{1}{2}\sum_{i=1}^N w_i^Tw_i)$
I code it up and generate some simulated data according to the graphical model. I set all the variables other than $w_i$ and $\gamma_w$ to their true value and keep unchanged. The hyperparameters are set to 0.001 which basically forces the prior of $\gamma_w$ to be 0 (I am actually not certain if this is a good choice). The sampler runs with $\gamma_w$ goes to infinity. The log-likelihood of samples goes to much larger than the log-likelihood of the true data.
From the conditional distribution, this non-convergence is actually kind of "observable", as each $w_{ik}$ is drawn with a larger precision than the sample from $p(\gamma_w|-)$ (because of the nonnegative term). Then when sampling $\gamma_w$, we are using a smaller $\sum_{i=1}^N w_i^Tw_i$. Thus the samples of $\gamma_w$ get larger.
This seems wrong, but I cannot figure out where the problem is. Any comments will be highly appreciated. Thanks!
Updated: I think this may be generally true for a hierarchical normal model with unknown variance, then I found this paper which seems to partially agree with my observation. But this paper is a little beyond my crappy math background...