Model selection for multilevel zero-inflated models and model assumption checks

Question

I am working on research that involves frequency of events (summed during an hour) and duration of events (calculated a seconds during an hour). Both data types have 0s and are positively skewed. The observations (n = 14,000+) are collected hourly for up to 2 weeks for 500 participants. A simple version of the model is: outcome ~ explanatory + (1|participant/date).

For the frequency data, I have fit:

(Hurdle) Negative Binomial (the var > mean)
(Hurdle) Gamma (log link) (assuming okay to treat frequency data as continuous)

For the duration data, I have fit:

(Hurdle) Gamma (log link)
Zero-Inflated Beta (after transforming duration by SECONDS/3600. I decided to try Beta because the hourly data is bounded at 3600 seconds).

N.B. If fitting regular gamma/beta, I first transformed data to remove 0s. For both the regular and zero-inflated beta, I also shrunk the 1s using EITHER the algorithm here: https://www.ncbi.nlm.nih.gov/pubmed/16594767 OR the inverse hyperbolic sine transformation (IHST)). I had thought about using ZOIB, but I did not want to use BRMS or GAMLSS (since I am a beginner).

I am also considering fitting a Tweedie distribution to both the frequency and duration data. (On a side note - does one transform Tweedie beta coefficients for interpretation - e.g., exponentiate and discuss as rates?).

I had thought about comparing AIC, R2, etc to the above, but understand this is suspect. Therefore, I thought about using cross fold validation. However, the package I was recommended to use (cvms https://github.com/LudvigOlsen/cvms) says it supports lmer/glmer, but doesn't appear to support glmmTMB, which is how I fit the above models.

I thought a potential solution might be to fit two separates models (binomial for 0/1 data and count/continuous for positive data, as suggested in this post: assessing glmmTMB hurdle model fit using DHARMa scaled residual plot). In this way, I could fit lmer/glmer, and perform cross fold validation using cvms on the positive data. But I'm not sure if this makes sense (to partition out the 0 data, and only use the positive data to validate the models)? I suppose I could create my own folds and simulate my own data to check predictive accuracy on the glmmTMB models?

Finally, I have used the performance package to check the model assumptions. The below are results for Hurdle Gamma on Duration (pic 1 and 2 have different explanatory variables), and are pretty similar to the results for the other models. Given the size of my sample and the distributions, how concerned should I be about the QQ plots and homogeneity of variance plots? We are not concerned with predicting new data, but explaining the data that we have.

Hi, about your idea with "to fit two separates models (binomial for 0/1 data and count/continuous for positive data" - you misunderstand the post, the zi model fits this in one hierarchical process, so there is no partitioning of the data. — Florian Hartig, Feb 11 '21 at 21:27

Model selection for multilevel zero-inflated models and model assumption checks

0 Answers0