1

Bit stuck on how to choose between models. My goal is to evidence the direction of the regression slope (negative shows improvement in a metric, positive shows decline).

Models 4 and 1b have the lowest AICc scores, but in the summary output their slopes aren't significant.

The remaining models (2a, 3, 1a, 2b) have higher AICc scores, and have significant slopes in the summary output.

aictab(cand.set = list(wa_glmm_1a, wa_glmm_1b, wa_glmm_2a, wa_glmm_2b, wa_glmm_3, wa_glmm_4), 
+        modnames = c("wa_glmm_1a", "wa_glmm_1b", "wa_glmm_2a", "wa_glmm_2b", "wa_glmm_3", "wa_glmm_4"), nobs = nrow(facs_3mth))

Model selection based on AICc:

            K     AICc Delta_AICc AICcWt Cum.Wt        LL
wa_glmm_4  11 32407.88       0.00      1      1 -16192.89
wa_glmm_1b  5 32473.03      65.15      0      1 -16231.50
wa_glmm_2a  5 40310.60    7902.72      0      1 -20150.29
wa_glmm_3   8 40316.30    7908.42      0      1 -20150.13
wa_glmm_1a  3 40386.08    7978.20      0      1 -20190.04
wa_glmm_2b  6 40386.75    7978.87      0      1 -20187.36

Here's the summary output from the lowest AICc scoring model (4)

summary(wa_glmm_4) # 3 Fix + 1 random intercept + 2 random (1 intercept, 1 slope) | ~ month_id + CareHomeSize + Ratings + (1|FacilityKey) + (1+month_id|FacilityKey)
Generalized linear mixed model fit by maximum likelihood (Adaptive Gauss-Hermite Quadrature, nAGQ = 0) ['glmerMod']
 Family: binomial  ( logit )
Formula: cbind(Wasted_N, TotalAdministrations) ~ month_id + factor(CareHomeSize) +      factor(Ratings) + (1 | FacilityKey) + (1 + month_id | FacilityKey)
   Data: facs_3mth

     AIC      BIC   logLik deviance df.resid 
 32407.8  32470.1 -16192.9  32385.8     2120 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-9.2439 -1.6898 -0.3326  1.4233 15.2294 

Random effects:
 Groups        Name        Variance Std.Dev. Corr 
 FacilityKey   (Intercept) 1.32136  1.1495        
 FacilityKey.1 (Intercept) 0.85391  0.9241        
               month_id    0.01802  0.1342   -1.00
Number of obs: 2131, groups:  FacilityKey, 294

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -7.108831   0.677533 -10.492  < 2e-16 ***
month_id              -0.003654   0.008601  -0.425    0.671    
factor(CareHomeSize)2  1.116707   0.172013   6.492 8.47e-11 ***
factor(CareHomeSize)3  1.671183   0.194661   8.585  < 2e-16 ***
factor(Ratings)2       0.459942   0.705246   0.652    0.514    
factor(Ratings)3       0.428870   0.690388   0.621    0.534    
factor(Ratings)4       0.501436   0.719224   0.697    0.486    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) mnth_d f(CHS)2 f(CHS)3 fc(R)2 fc(R)3
month_id    -0.089                                     
fctr(CrHS)2  0.000  0.002                              
fctr(CrHS)3  0.000  0.004  0.600                       
fctr(Rtng)2 -0.953  0.003 -0.163  -0.179               
fctr(Rtng)3 -0.974  0.003 -0.168  -0.140   0.968       
fctr(Rtng)4 -0.935  0.002 -0.177  -0.162   0.934  0.950

And here's the summary output of the lowest AICc scoring significant models (2a)

summary(wa_glmm_2a) # 2 Fix + 1 random intercept | ~ month_id + factor(CareHomeSize) + (1|FacilityKey)
Generalized linear mixed model fit by maximum likelihood (Adaptive Gauss-Hermite Quadrature, nAGQ = 0) ['glmerMod']
 Family: binomial  ( logit )
Formula: cbind(Wasted_N, TotalAdministrations) ~ month_id + factor(CareHomeSize) +      (1 | FacilityKey)
   Data: facs_3mth

     AIC      BIC   logLik deviance df.resid 
 40310.6  40338.9 -20150.3  40300.6     2126 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-12.5595  -2.1237  -0.4018   1.6615  19.9675 

Random effects:
 Groups      Name        Variance Std.Dev.
 FacilityKey (Intercept) 1.293    1.137   
Number of obs: 2131, groups:  FacilityKey, 294

Fixed effects:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -6.7085120  0.1365901 -49.114  < 2e-16 ***
month_id               0.0036121  0.0008711   4.147 3.37e-05 ***
factor(CareHomeSize)2  1.1396969  0.1668958   6.829 8.56e-12 ***
factor(CareHomeSize)3  1.7190618  0.1867524   9.205  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) mnth_d f(CHS)2
month_id    -0.046               
fctr(CrHS)2 -0.817 -0.001        
fctr(CrHS)3 -0.730  0.000  0.597 

So do I choose the model with the lowest AICc score regardless of the significance (i.e. choose 4), or do I choose the one with the lowest AICc score that also has a statistically significant result in the summary output (i.e. choose 2a)?

Or am I thinking about this completely wrong.

B_Real
  • 55
  • 2
  • 8
  • My goal is to evidence the direction of the regression slope (negative shows improvement in a metric, positive shows decline). This part is confusing. –  Jul 05 '19 at 16:42
  • And yes, you Are cmpletely wrong. –  Jul 05 '19 at 16:44
  • Perhaps, readings on overfitting and underfitting as well as Likelihood Ratio, BIC etc could be helpful to you. –  Jul 07 '19 at 00:44
  • Possible duplicate of [Conflicting approaches to variable selection: AIC, p-values or both?](https://stats.stackexchange.com/questions/265572/conflicting-approaches-to-variable-selection-aic-p-values-or-both). Also related: https://stats.stackexchange.com/q/35353/176202, https://stats.stackexchange.com/q/9171/176202 – Frans Rodenburg Jul 07 '19 at 10:50
  • @SubhashC.Davar Not sure what's confusing about it? My response is a measure of error (ratio of medication error to total medications given). If the ratio gets smaller over time, this will mean that the error metric is improving, and the direction of the regression slope will be negative. So I've got `month` as my main explanatory variable, and then I have another two explanatory variables (`CareHomeSize` and `Ratings`) that I want to control for. – B_Real Jul 08 '19 at 09:23
  • I do not know how month could influence your postulated dependent variable? –  Jul 08 '19 at 13:51
  • Moreover, your problem will be clear to you if you discriminate between Glmm I.e. linear models and Regression. The regression is completely different from linear models. –  Jul 08 '19 at 13:55
  • You can redesign your model for the problem you are trying to solve. Define the goal correctly . Users may be able to give the valuable suggestions –  Jul 08 '19 at 14:06
  • Month is effectively 'time using a system'. The idea is that by using the system properly, processes will improve and subsequently the error metrics will too. – B_Real Jul 09 '19 at 09:01
  • Regression is completely different from linear models? I've just gone down some rabbit holes on the internet regarding this and I can conclude that this does not appear to be a well understood concept! In fact, my brain has just melted. https://en.wikipedia.org/wiki/Linear_model "the term [linear model] is often taken as synonymous with linear regression model". So confused! – B_Real Jul 09 '19 at 09:56
  • I think in the context of my goal, I need to ignore AIC and p values and use the model that incorporates all the information I need, which is actually the third model `wa_glmm_3`. This model basically includes everything i need to control for. For this particular metric it just so happens that its not statistically significant which I guess is still useful to know (kinda means that using the system has no effect on the metric, with the information I have available). – B_Real Jul 09 '19 at 10:15

0 Answers0