5

For concreteness:

library( mgcv )
set.seed( 1 )
RawData <- data.frame( y = rbinom( 1000, 1, 0.5 ), x1 = rnorm( 1000 ),
                      x2 = as.factor( rbinom( 1000, 1, 0.5 ) ), x3 = rnorm( 1000 ),
                      x4 = as.factor( rbinom( 1000, 1, 0.5 ) ) )
fit <- gam( y ~ s( x1 ) + x2 + s( x3, by = x2 ) + x4, data = RawData, 
            family = nb( link = log ) )

How to measure the importance of these four variables?

I understand that "variable importance" is not a well-defined concept, so I am looking for the most straightforward way, such as an explained variance approach.

The ANOVA table seems to be a natural choice, however, as explained in this answer, it is not working: for the smooth terms in GAM models they do not have an explained variance interpretation.

What is the sound approach then?

Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153
Tamas Ferenci
  • 3,143
  • 16
  • 26
  • What is importance here? Statistical significance? – AdamO Nov 15 '17 at 22:53
  • @AdamO Definitely no. I understand how I can obtain e.g. _p_-values, but this is not what I want here conceptually. I need something that is analogous to the chi squared values in the ANOVA table of a usual linear regression, i.e. a way to measure the contribution of each variable to the whole that is explained by the model. – Tamas Ferenci Nov 16 '17 at 13:06
  • 1
    Many of these techniques in statistical learning use other metrics for variable importance; change in MSE when included in a random forest model bag sample, R2, or GCV score in MARS models. As there isn't a clear statistic that measures this that is routinely output by *mgcv*, it would help to think how might you otherwise measure variable importance, even conceptually or at a high level. This might indicate how to proceed with computing something you can use practically to answer your question. – Gavin Simpson Nov 17 '17 at 18:14
  • 1
    @GavinSimpson Sure, I happily share you my original problem that motivated this question. Let `x3` be the age of a patient, `x2` be the sex, `x1` be his/her blood pressure, `x4` be whether he/she received a certain drug, and `y` be the number of times a certain event happened to him/her. The question the doctors ask: "OK, I understand blood pressure is significant, but we have a very high sample size, so it doesn't mean a lot, I'd be more interested to see how it compares to the other predictors in explaining y". – Tamas Ferenci Nov 18 '17 at 14:59
  • 2
    Yeah, I get that, but for you, what would be a good a good way to measure "explaining y"? Would a loss function work, and you could compare the Poisson loss of models with and without a particular covariate? You'd need to decide whether to reevaluate the other smooths or keep them fixed at their smoothness parameter estimates from the full model? – Gavin Simpson Nov 18 '17 at 16:33
  • 1
    @Gavin Simpson : Very, very good questions. I'd really appreciate to read a review of the possibilities, as honestly, I've no idea what would work, and what decisions are meaningful in the scenario I described. – Tamas Ferenci Nov 18 '17 at 22:52

0 Answers0