0

I have done a logistic regression with two independent variables (x1 and x2) and a dependent binary variable (y). The AUC (roc curve) is 0.7915.

The heatmap below shows the probabilities resulting from the predict function. It seems the logistic regression is not flexible enough since the range of p (probabilities) for x1=10 is 0.25 – 0.76 while in reality this range is 0.25 – 1.

enter image description here

For that reason I have use the GAM - package to create a GAM with smoothing splines.

mygam=gam(y~s(x1,df=100)+s(x2,df=100),family=binomial,data=mydata3)

This generates an improvement in AUC to 0.807. The heatmap below shows the probabilities resulting from the GAM-predict function.

enter image description here

Although there is still an area for x1 between 0 and 10 where the GAM shows large differences in comparison to a Pivot-table of my raw data (for every (x1,x2): count(y==1) / total y). The heatmap below shows the differences between the GAM and the Pivot table:

enter image description here

Even if I set the degrees of freedom higher than 100, this give no improvement. I have also tried lo for local regression although no improvement. I also have tried several settings with mgcv although it seems not possible to make a GAM flexible enough to fit the range for x2 between 0 and 10.

Would there be any possibility to reach the fit with a GAM or is the data to steep?

Thanks a lot!

Ps: only the labels in the first graph are swapped (not the variables).

Marcel
  • 89
  • 2
  • 8
  • You lost me when you stated the range of y is 0.25 - 0.76 after describing y as a "binary variable." To me, as well as to a great many other readers, the latter means y can have only the values 0 and 1. In light of this, I'm having trouble understanding what any of your graphs are trying to represent. Could you edit this question to clarify what you are working with and to explain the figures? – whuber Jul 02 '19 at 21:12
  • You've tagged this with [tag:mgcv] but state you used the gam package. Can you clarify what you are using? – Gavin Simpson Jul 02 '19 at 21:38
  • @Whuber these graphs are based on the predict functions. The binary variable is predicted with the logistic regression in the first two graphs and by using Group By resulting in an average for every combination of x1 and x2 (a Pivot table) as comparison in the third graph, resulting in the difference between a Pivot table and the GAM in the third graph. – user2165379 Jul 02 '19 at 22:04
  • @user2165379 That means the graphs haven't been characterized correctly. Note, too, that the `predict` function has several kinds of output you can specify: probabilities or log odds. – whuber Jul 02 '19 at 22:06
  • @ Gavin Simpson initially I used mgcv as you reacted on in this post https://stats.stackexchange.com/a/415358/131164 After trying several options with the parameters I have used the gam-package. I will discard mgcv, although I thought they where closely related so I thought I would reach more people by adding it. – user2165379 Jul 02 '19 at 22:09
  • @ whuber you are right. I will characterize the graphs in more detail. I used probabilities as output. I will add this in the post too. – user2165379 Jul 02 '19 at 22:12

2 Answers2

1

[I'm going to assume that you are using mgcv as indicated by the tag; if you aren't, consider switching to it as it provides a much more modern approach to estimating GAMs than the gam package.]

It seems like you want to model a bivariate smooth or smooth interaction between x1 and x2.

If you want to model a smooth interaction between two variables then you can do this via a tensor product smooth in mgcv using either the te() or t2() functions. You can do multivariate smooths with s() but those are isotropic and as such not generally suitable for modelling interactions where the covariates involved in the interaction will typically be measured in different units and on different scales.

With te() or t2() you can set the dimensions of the two marginal bases as required by passing a vector to argument k. By the looks of the output you showed, whilst you are passing it k = 100 for the marginal smooths, the penalised spline estimat is one for a much smoother surface — so you probably don't want to pass k = c(100,100) as that would be a ridiculously complex basis expansion to work with.

mygam <- gam(y ~ te(x1, x2, k = c(20,20)),
             method = 'REML',
             family = binomial, data = mydata3)

This will estimate a smooth interaction (main effects plus interaction) between x1 and x2 with smoothness selection performed using REML, which is preferred over the default GCV. Each marginal basis will use 20 basis functions, for a total of 400 (20*20), although one will be removed due to identifiability constraints with the intercept.

If you want to decompose the smooth interaction into main (marginal) effects and pure interaction then we can use the ti() smoother as follows:

mygam <- gam(y ~ s(x1, k = 20) + s(x2, k = 20) + ti(x1, x2, k = c(10,10)),
             method = 'REML',
             family = binomial, data = mydata3)

where the s() termsdefine the main effects of each variable and the ti() term has has the marginal effects of the full basis (that you'd get withte()`) removed.

The t2() version works like te() but uses a different parameterisation of the tensor product, although an advantage is that it it is easier to decompose the t2() into the two marginal effects and an interaction effect than with te(). The main advantage of t2() is that it can be used with gamm4::gamm4(), whereas te() can't.

All that said, the final graph you show suggests that the data exhibit some thresholds and cut points, which may not be well modelled with a smooth function.

Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153
  • Thanks a lot for your extensive answer. I will try to implement this. – user2165379 Jul 02 '19 at 22:16
  • @ Gavin Simpson this is exactly what I was looking for. The large differences between the GAM and the Pivot for x1 between 0 and 10 have been dissapeared with your settings. Moreover, with `te` or `t2` and `k c(10,10)` the AUC improves to 0.8099. With `k c(20,20)` the AUC improves further to 0.8101. The `s + ti`-setting has an identical AUC of 0.8101. So now I can conclude that the method is flexible enough to fit my data. Would it be a logical step for me to add two loops around the code above with 20 x 20 combinations for k = c(a, b) and cross validate the model to find the optimal a and b? – Marcel Jul 03 '19 at 13:10
  • No, don't do that. I would actually spend an hour or more learning about how penalised regression spline models in GAMs like this work. A colleague has produced this [free course](https://noamross.github.io/gams-in-r-course/) on GAMs which is good (FD: it's based on course materials I contributed too). The basic ide is you use `gam.check()` to see if `k` was large enough, if not, double (say) `k` and refit. If the EDF of the estimated smooth doesn't change much and the test for `k'` is OK-ish (it is a heuristic) then you had large enough `k`. – Gavin Simpson Jul 03 '19 at 15:42
  • Basically, all you need is for `k` to be large enough to contain the true function or an approximation to it, plus a little bit. The penalty will take care of the rest. – Gavin Simpson Jul 03 '19 at 15:42
  • @ Gavin Simpson interesting there is a course! I will go through the course in the coming days. If I still have the question regarding cross validation after the course I let you know. Thanks! – Marcel Jul 03 '19 at 18:32
  • @ Gavin Simpson Hi Gavin, I did the complete course which was very interesting. If I do understand correctly the bias-variance optimization is done internal by the penalty within the GAM. Although my data is grouped. For that reason I have to seperate my data normally by groups during CV so that a group can not be in train- an testset simultaniously. Can this be a problem if I use the GAM since I have no influence on this separation during the GAM optimization? – Marcel Jul 25 '19 at 11:36
  • @Marcel You would need to tell `gam()` about the grouping so the effect gets modelled. – Gavin Simpson Jul 26 '19 at 08:01
  • @ Gavin Thanks. Do you mean by adding a column in my dataset with the number of each group? I don't see anything specific regarding grouping in the helpfiles if I look with ?gam in R. – Marcel Jul 27 '19 at 15:06
  • @Marcel no; for each row in the data set you have a variable that records to which group each observation belongs. Say this group variable is called `group`, you would add `+ group` for fixed group intercepts, `+ s(group, bs = ‘fs’)` for random intercepts, or lots of other options if you separate smooths for the groups. This is a bit like HGAM, see: [10.7717/peerj.6876](https://doi.org/10.7717/peerj.6876) – Gavin Simpson Jul 29 '19 at 09:15
  • @ Gavin Simpson Thanks a lot Gavin. I will study this. – Marcel Jul 30 '19 at 10:06
  • I made a boo boo in my previous comment; `s(group, bs = ‘fs’)` should have been `s(group, bs = ‘re’)`, the `"fs"` basis comes up in the "lots of other options" things I referred to, and which we use a lot in the HGAM paper I linked to. – Gavin Simpson Aug 02 '19 at 16:27
0

Try something like:

mygam = gam (y ~ s(x1, x2), family=binomial, data=mydata3)

In your original code, I believe you're treating x1 and x2 distinctly rather than jointly. Combining the two into a single smooth may help. You might also want to investigate other smoothers besides s.

Of course, you are using smoothing and things can only be so flexible.

Wayne
  • 19,981
  • 4
  • 50
  • 99
  • @ Wayne Thank you. If I replace the separate variables y ~ s(x1) + s(x2) by (y ~ s(x1, x2) the AUC falls from 0.807 to 0.6506. If I add it as an interaction term in this way: y ~ s(x1) + s(x2) + s(x1, x2) the AUC is similar as without the interaction term (0.807). I also tried `lo` instead of `s` although this did not give improvement. Good to investigate, thanks for your suggestions. – Marcel Jul 03 '19 at 10:18