Why does adding a lag effect increase mean deviance in a Bayesian hierarchical model?

Question

Background: I'm currently doing some work comparing various Bayesian hierarchical models. The data $y_{ij}$ are numeric measures of well-being for participant $i$ and time $j$. I have around 1000 participants and 5 to 10 observations per participant.

Like with most longitudinal datasets, I am expecting to see some form of auto-correlation whereby observations that are closer in time have a greater correlation than those that are further apart. Simplifying a few things, the basic model is as follows:

$$y_{ij} \sim N(\mu_{ij}, \sigma^2)$$

where I am comparing a no lag model:

$$\mu_{ij} = \beta_{0i}$$

with a lag model:

$$\mu_{ij} = \beta_{0i} + \beta_{1} (y_{i(j-1)} - \beta_{0i}) $$

where $\beta_{0i}$ is a person-level mean and $\beta_1$ is the lag parameter (i.e., the lag effect adds a multiple of the deviation of the observation from the previous time point from the predicted value of that time point). I've also had to do a few things to estimate $y_{i0}$ (i.e., observation prior to the first observation).

The results I am getting indicate that:

The lag parameter is around .18, 95% CI [ .14, .21]. I.e., it's non-zero
Mean deviance and the DIC both increase by several hundred when the lag is included in the model
Posterior predictive checks show that by including the lag effect, the model is better able to recover the auto-correlation in the data

So in summary, the non-zero lag parameter and the posterior predictive checks suggest the lag model is better; yet mean deviance and DIC suggest that the no lag model is better. This puzzles me.

My general experience is that if you add a useful parameter it should at least reduce the mean deviance (even if after a complexity penalty the DIC is not improved). Furthermore, a value of zero for the lag parameter would achieve the same deviance as the no lag model.

Question

Why might adding a lag effect increase mean deviance in a Bayesian hierarchical model even when the lag parameter is non zero and it improves posterior predictive checks?

Initial thoughts

I've done a lot of convergence checks (e.g., looking at traceplots; examining variation in deviance results across chains and across runs) and both models seem to have converged on the posterior.
I've done a code check where I forced the lag effect to be zero, and this did recover the no lag model deviances.
I also looked at mean deviance minus the penalty which should yield deviance at expected values, and these also made the lag model appear worse.
Perhaps the lag effect reduces the effective number of observations per person which reduces the certainty in estimating the person level means ($\beta_{0i}$) which increases deviance.
Perhaps there is some issue with how I've estimated the implied time point before the first observation.
Perhaps the lag effect is just weak in this data
I tried estimating the model using a maximum liklihood using lme with correlation=corAR1(). The estimate of the lag parameter was very similar. In this case the lag model had a larger log likelihood and a smaller AIC (by about 100) than one without a lag (i.e., it suggested the lag model was better). So this reinforced the idea that adding the lag should also lower the deviance in the Bayesian model.
Perhaps there is something special about Bayesian residuals. If the lag model uses the difference between predicted and actual y at the previous time point, then this quantity is going to be uncertain. Thus, the lag effect will be operating over a credible interval of such residual values.

You say that the lag parameter is around .18. Did you learn the lag-parameter? If yes, what prior did you use? — Summit, Mar 27 '15 at 07:49
I used a uniform -.6 to .6 on the lag parameter. y0[i] is drawn from $N(\beta_{0i}, \sigma^2)$ — Jeromy Anglim, Mar 28 '15 at 08:52

score 1 · Answer 1 · answered Mar 29 '15 at 06:14

Here are my thoughts:

Instead of DIC, BIC, AIC I suggest to directly work with the marginal likelihood (also known as evidence) if you can afford it. The larger the evidence, the more likely is your model class. It may not make a large difference, but DIC, BIC, AIC are, after all, only approximations.
In order to check if a lag-effect leads to a larger marginal likelihood, I suggest to perform the following initial check: Take the model that includes the lag-parameter. (a) Fix the lag-parameter to $0.18$. (b) Set the lag-parameter to zero. Compute the marginal likelihood of both model classes. Model class (a) should have the larger marginal likelihood.
Let's go a step further: Take the model that does not consider the lag-effect (c) and compute its marginal likelihood. Next, take your model class (d) that incorporates the lag-effect and has a prior on the lag-parameter; compute the marginal likelihood of (d). You would expect that (d) has a larger marginal likelihood. So what, if you don't?:

(1) The marginal likelihood considers the model class as a whole. This includes the lag-effect, the number of parameters, the likelihood, the prior.

(2) Comparing models that have a different number of parameters is always delicate, if there is considerable uncertainty in the prior of the additional parameters.

(3) If you specify the uncertainty in the prior of your lag-parameter unreasonably large, you penalize the entire model class.

(4) What is the information that supports equal probabilities for negative lags and for a positive lag? I believe that it is very unlikely to observe a negative lag, and this should be incorporated in the prior.

(5) The prior that you chose on your lag-parameter is uniform. This is usually never a good choice: Are you absolutely sure that your parameters must really lie inside the specified bounds? Does each lag-value inside the bounds really have equal likelihood? My suggestion: go with a beta-distribution (if you are sure that the lag is bounded; or with the log-normal if you can exclude values smaller than zero.

(6) This is a particular example, where the use of non-informative priors is not good (looking at the marginal likelihood): You will always favor the model that has a smaller number of uncertain parameters; it does not matter how good or bad the model with more parameters could do.

I hope my thoughts give you some new ideas, hints?!

Thanks for the tips. Just to round things out, I tried constraining the lag parameter to have the value of the mean of the posterior (i.e., 0.18). The no lag model still had the smaller mean deviance. — Jeromy Anglim, Mar 31 '15 at 05:01

Why does adding a lag effect increase mean deviance in a Bayesian hierarchical model?

Question

Initial thoughts

1 Answers1