6

I want to model a response variable (y) as a function of two explanatory variables (x and z). y are concentration measures of a physiological parameter of the human body with a worldwide accepted method (method A). x is a novel method (method B) that might be alternative since it is much cheaper. However, it is thought that method B is more or less accurate depending on the time (z) that the equipment needed is used. So, my goal is to test the significance of x and z to accurately measure y. I have 6 individuals (ID: A,B,C,D,E,F). Below I show the relationship between x and y:

enter image description here *Note: here I categorized Z for illustration purposes, but z, as x, is numerical, ranging from 1 (hr) to 7 (hr).

Given that my response variable has a sharp non-normal distribution, that the variance increases as x increases, and that I have several individuals but I am not interested in differences among individuals, I though on GLMM with a GAMMA distribution to test the significant effect of x and z for explaining y.

enter image description here

I run GLMMs with a gamma distribution and using the log link.

## Setting random structure ####
mod1<-glmer(Y~ 1 + (1|ID),data = df, family=Gamma(link=log)) 
mod2<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log)) 
AIC(mod1,mod2)

## Setting fixed structure ####
mod1<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod2<-glmer(Y~X + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod3<-glmer(Y~X + Z + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod4<-glmer(Y~X + X:Z +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 

r.squaredGLMM(mod4)[1,c(1,2)]

mod.list <- list(mod1,mod2,mod3,mod4)
model.sel<-model.sel(mod.list, rank="AIC",extra=c(r2=function(x) round(r.squaredGLMM(x)[1,c(1,2)],2)))

To test if x and z are significant I compared models by AIC.

model.sel

Model selection table 
  (Intr) ns(VDB.V13,3)    n.V13 cnd((Int)) dsp((Int)) cnd(ns(VDB.V13,3)) cnd(n.V13:VDB.V13) r2.R2m r2.R2c    class   control ziformula dispformula     random df   logLik     AIC  delta weight
3 -2.567             + -0.04178                                                               0.66   0.71 glmerMod gC(Nl_Md)                          VD.V1|I  9 2017.580 -4017.2   0.00      1
2 -2.645             +                                                                        0.65   0.70 glmerMod gC(Nl_Md)                          VD.V1|I  8 2006.875 -3997.7  19.41      0
4                                   -2.661          +                  +                  +   0.66   0.76  glmmTMB                  ~0          ~1 c(VD.V1|I)  9 2001.622 -3985.2  31.92      0
1 -1.559                                                                                      0.00   0.36 glmerMod gC(Nl_Md)                          VD.V1|I  5 1682.428 -3354.9 662.31      0
Abbreviations:
control: gC(Nl_Md) = ‘glmerControl(Nelder_Mead)’
Models ranked by AIC(x) 
Random terms: 
VD.V1|I = ‘VeDBA.V13AP | ID’
c(VD.V1|I) = ‘cond(VeDBA.V13AP | ID)’

My problem comes when I observe the residual patterns vs predicted values since I think there are clear patterns. Here I don't show the distribution of the residuals since if I understand correctly, a normal distribution of them is not needed.

enter image description here

Does anyone know what I could do? Any proposal? I did not share the data because is too long (n=2027).

Thanks in advance

Head of my dataframe:

   ID          Y          X Z
1   A 0.34136077 1.55682000 2
2   A 0.05124066 0.05766000 2
3   A 0.05901189 0.05125333 3
4   A 0.05213855 0.05766000 2
5   A 0.05437708 0.05125333 3
6   A 0.08433229 0.05766000 3
7   A 0.03618396 0.04484667 3
8   A 0.03622474 0.05766000 1
9   A 0.18244336 0.05125333 3
10  A 0.03625487 0.03844000 2
11  A 0.03840890 0.04484667 3
12  A 0.04235018 0.03844000 3
13  A 0.03862926 0.03844000 3
14  A 0.03749647 0.02883000 2
15  A 0.04395015 0.03844000 2
16  A 0.04040225 0.04805000 2
17  A 0.04419507 0.05766000 3
18  A 0.33186947 2.53704000 1
19  A 0.31986092 0.74958000 1
20  A 0.08127853 0.05766000 1
")

Procedure followed according to the proposal of JTH

mod1<-glmer(Y~ 1 + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod2<- glmer(Y~ ns(X, 4)  + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod3<-glmer(Y~ ns(X, 4) + Z + (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod4<-glmer(Y~ X:Z + ns(X, 4) +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
mod5<-glmer(Y~ ns(X, 4):Z +  (X|ID),data = df, family=Gamma(link=log), control = glmerControl(optimizer ="Nelder_Mead")) 
AIC(mod1,mod2,mod3,mod4,mod5)

The results are next:

r.squaredGLMM(mod4)[1,c(1,2)]

mod.list <- list(mod1,mod2,mod3,mod4)
model.sel<-model.sel(mod.list, rank="AIC",extra=c(r2=function(x) round(r.squaredGLMM(x)[1,c(1,2)],2)))

model.sel

        Int ns(X)            Z       Z:X Z:ns(X)  r2m  r2c df   logLik       AIC     delta       weight
4 -2.827029     +              0.1871558         0.65 0.69 10 320.9342 -621.8684   0.00000 1.000000e+00
2 -2.660326     +                                0.48 0.54  9 289.8442 -561.6884  60.17996 8.552394e-14
3 -2.660204     + 0.0001000814                   0.48 0.54 10 289.8443 -559.6886  62.17976 3.146559e-14
5 -2.220731                                    + 0.04 0.72  9 259.3610 -500.7219 121.14643 4.936132e-27
1 -1.377842                                      0.00 0.27  5 246.9520 -483.9039 137.96446 1.100015e-30

The plot of my residuals vs predicted values is this now:

enter image description here

Although the patterns have been sharply reduced, I suspect there is still some heteroscedasticity. However, I can't remove it.

Dekike
  • 411
  • 1
  • 10
  • Covariate `X` is a method right? Is that coded as a factor? If so, why is it scaled? I recognize that you have a lot of data, but maybe you can paste the head and tail of your data frame? – JTH Sep 19 '20 at 17:32
  • Hi @JTH. No, `X` is numerical, it is the concentration measurement of a physiological parameter using method `B`. `Y` is the concentration of the same physiological parameter but using method `A`, which is the method worldwide validated. I want to test how well `X` explains `Y` given that `X` might interact with `Z`, which is the time spend with the method `B` for estimating `A` (you can spend more or less time, depending on the budget you have). So, I want to know if `X` predicts considerably well `Y` and if `Z` affects to this prediction and how much. – Dekike Sep 19 '20 at 19:26
  • I though to calculate the explained variance by the fixed factors (R2m) and the variance explained by the fixed factor plus the random ones (R2c) to quantify those things. However, my problem is that I get residuals' patterns. – Dekike Sep 19 '20 at 19:27
  • You see values below 0 for `X` because it is scaled, since I have read that it is better to scale explanatory variables when you include interactions in the model. – Dekike Sep 19 '20 at 19:28
  • I also added a small part of my dataframe as you suggested. – Dekike Sep 19 '20 at 19:40
  • 1
    I would try to add a natural spline to the regression equation to stamp out the pattern in the residuals. possibly something like `glmer(y ~ x*z + ns(x, df=4) + (1 + x | ID), data= data, family=Gamma(link='log')`. It may take some experimentation. – JTH Sep 19 '20 at 20:24
  • Where you wrote `df=4` is `knots=4`? Did you make a mistake? And `4` means the number of "corners" of my spline? One last thing, why did you propose `1 + X|ID)` instead of (`X|ID)`. When you say that it may take some experimentation, what do you mean? What parameters do I play with? With `knots` only? Sorry for all the questions, but I want to do things well. Thanks for your help – Dekike Sep 20 '20 at 08:49
  • I have tried it @JTH, and it improves the residual plot considerably. I share in the post the updated code after your advice and the plot of my residuals vs predicted values. However, I have some doubts. You proposed `m4 – Dekike Sep 20 '20 at 10:34
  • Do you know what is more correct from a statistic point of view, `m4` or `m5`? In the post I show the GLMMs I compared using natural splines by AIC. I would like to know if you think it is correct the way I built the models to compare them and say `m4` is the best model. If you help me with that and with the interpretation of the results in your answer, I will give you the bounty for sure. – Dekike Sep 20 '20 at 10:38
  • 1
    You can use AIC to select among these models. m4 looks much better than m5, not only because it has a lower AIC, but because I think the interaction `ns(x)*z` could be hard to interpret. As for the interpretation, I think that's tricky. I'm already not sure off the dome how to best interpret the coefficients of a simple gamma glmer, adding a spline to the mix makes this harder. Sorry I got DF and knots mixed up. What I meant by experimentation, is that you should try different values for the `knots=` and possibly try different splines, e.g. `bs` instead of `ns` I think `(1+X|ID)=(X|ID)` – JTH Sep 20 '20 at 13:27
  • 1
    See https://stats.stackexchange.com/questions/96972/how-to-interpret-parameters-in-glm-with-family-gamma for information on coefficient interpretation. Because you have a mixed model, your interpretations are additionally conditional on the predicted values of the random effects. I wouldn't try to interpret the `ns` coefs, rather try to make plots of the model predictions (use `predict` function). – JTH Sep 20 '20 at 13:41
  • In my case, what I want is to know if `X` predicts OK `Y` and if `Z` is essential or not. I am not sure that I need to interpret the coefficients. I just want to prove that the method B (which calculates `X`) is good enough and that `Z`, although it significant, its weight is low. That is why I calculate R2m and R2c and I compare those values among models. A colleague said me that he sees heteroskedasticity in the residuals... Do you agree? – Dekike Sep 20 '20 at 14:32
  • 1
    Yes, it does look heteroskedastic. I'm not a expert on deviance residuals, so perhaps it's okay. You might try `glmer(y ~ x*z + ns(x, 3) + (x + ns(x,3) | ID), ... )` to fit a different natural spline for each level of ID. As always, plot the data and predictions together to best check the model. – JTH Sep 20 '20 at 21:20
  • 1
    scaling x and z was a bad idea. collinearity is not that much of an issue (and can be solved differently) and interpretability is affected. most importantly now, we don't know what was the range of that variables and which transformation to reccomend you. – carlo Sep 26 '20 at 07:14
  • I updated the post changing to original values of all my variables @carlo – Dekike Sep 26 '20 at 09:46

0 Answers0