Application of GAM on large dataset

Question

I was suggested that my questions were too broad. As I commented below, I have nearly a million data points and perhaps a hundred variables. This may be a very basic modeling question: I am curious to know how to start a GAM with a large dataset. I have tried the 'bam' function with a much smaller dataset, and it didn't work as what I expected. I do have access to supercomputers, but it still seems unpractical to tune a GAM with this big dataset. I was suggested to pick 8 to 10 variables and fit a GAM. Still, it is slow to run a GAM with the complete dataset. So my guess is that I need to reduce the number of variables and sample size to fit a GAM.

My original questions: I have 61 bioclimatic variables that explain different or similar aspects of insect life cycles and some of them are highly correlated. My study extent covers the North American continent and the spatial resolution is 10 km. The temporal resolution is yearly and temporal range is 20 years. This means that my dataset is huge for GAMs. I have built models using GLMs instead for prediction purpose. However, the models are complicated (e.g., 777265*263) and not easy to interpret. So I am trying to use GAMs to build small models that only include fewer variables and some percentage of samples for interpretation purpose. I followed some questions on the package 'mgcv' and found that most of the examples are using a very small number of variables. Does that mean I need to handpick the variables? I used the 'gam.selection' function with a smaller dataset (828*54), and I can see some variables are not significant in a smoothed term. I also used the 'concurvity' function to examine potential multicollinearity. Now I need some suggestions on variable selection: What are the appropriate number of variables for an explanatory GAM? Do I select the variables based on my knowledge, the 'gam.selection' results where a significant nonlinearity is detected, and the 'concurvity' results? Or what would be the most efficient way in this variable selection process? I appreciate your thoughts and timely help.

You should probably explain what some of these numbers mean; is 263 just the number of covariates *before* turning them into smooths? For example... If you really don't know which smooth effects to retain, I would suggest turning on `select = TRUE` and using the `bam()` function in *mgcv*. — Gavin Simpson, May 07 '19 at 19:26
I've written at least one answer on this general area: https://stats.stackexchange.com/a/405292/1390 — Gavin Simpson, May 07 '19 at 19:36
Thanks for replying! The numbers above were mostly to show my data properties and you can skip them if you get the idea that I have a very large sample size (nearly 1M) and a long list of variables. I have tried both gam.selection and bam functions on a smaller dataset (surprisingly, bam is slower than gam in my case; but I may have used wrong parameters). My questions were how to tune GAM for my data at the beginning and what to do next after gam selection (for example, do I remove all the variables that are not significant after selection and tune the smooth effects with the rest variables?) — Dongmei Chen, May 07 '19 at 20:57
@usεr11852 That will save you time on constructing the basis, but computationally with ~ 0.75 million observations the bottleneck is going to be the data size. `bam()` with `s(..., bs = 'cr')` is probably as efficient it is going to get with the sorts of models *mgcv* can fit. — Gavin Simpson, May 07 '19 at 22:53
If you are using the double penalty approach, then you just fit the model and accept what it says, as long as you've thought about the covariates you are putting in to the model. You don't want to tune `gam()`. You want to fit a model that that shrinks all terms and `select = TRUE` will do that. Don't remove variables that are not significant; that is a hard statement that the effect is === 0. Keep them in and the `summary()` outputs will reflect the uncertainty over whether something is even in the model or not. — Gavin Simpson, May 07 '19 at 22:56
@GavinSimpson Thanks! I will try this method (`bam()` with `s(..., bs = 'cr')`) and see how it works. Forgive me that I am not very familiar with GAM. Do you mean that I can add the covariates in the model without examining their collinearity beforehand? — Dongmei Chen, May 08 '19 at 01:23
Colinearity won't really tell you much as you need to be concerned with concurvity, which is the nonlinear counterpart to colinearity. One would assume you had idea a priori as to which variables are hypothesised to be important (from previous studies) or you knew which were correlated and might choose to not include all possible variables but a curated set. It seems difficult to imagine a situation where you know enough to expect smooth effects of some variables from the but you don't know which variables from the set to include initially. — Gavin Simpson, May 08 '19 at 15:16
That said, using `method = 'REML', select = TRUE` or `method = 'ML', select = TRUE` will give you best chance of not succumbing to concurvity issues, and you can always check afterwards with the `concurvity()` function that *mgcv* provides. — Gavin Simpson, May 08 '19 at 15:17
Given that SImon Wood and colleagues have used `bam()` on fits on the order of millions to 10s of millions of observations, I would venture that your problem is not so huge as to be impractical. What I would also venture is that you are probably going about this problem the wrong way if you are throwing all manner of 61 bioclimatic variables. Surely you can choose from among them to limit the model space first - i.e. you don't need to get it down to 8-10, but you might not want to include min, mean, and max temp for all months as variables in the model... — Gavin Simpson, May 08 '19 at 15:22
Make sure you have read `?bam` and understood how to get the best from it; you'll want `method = 'fREML', select = TRUE, discrete = TRUE` and you'll want to be able to use multiple cores (setting `nthreads` and making sure you have multithreading capabilities in your BLAS), and make sure you ahve a lot of RAM available. See the papers cited in `?bam` for indications of the problem size that this function is designed for. Then be prepared for this to take a while to fit. — Gavin Simpson, May 08 '19 at 15:23

Application of GAM on large dataset

0 Answers0