With the goal of selecting predictors for a 4 level outcome variable I want to apply LASSO for predictor selection. Some continuous variables are related to each-other and should all be in the final model, or none.
Context: The dataset consists of around 1700 rows of data ($n$) with around 30 predictors (a mix of continuous predictors, binary categorical and multilevel categorical ones). As stated, the outcome/dependent variable is a 4-level categorical one (not ordered). The frequency of the 4 outcome groups is around 600/250/250/600.
Now I recon some associations between continuous predictors and the outcome are not linear. But, I do not have an idea of the functional form of these possible non-linear associations. So, I wanted to use (restricted cubic) splines to allow for non-linear associations for the continuous variables.
The problem is that when creating a spline this is actually a data transformation where a continuous variable $x$ is used to create 'spline-variables' based on splineknots. Simply put, when the knots are set at 30 and 60, the set of spline variables would consist of $x$, $x'$ and $x''$, where:
- $x$ is the original continuous variable;
- $x'$ is $0$ when $x<30$, and a function $f(x-30)$ when $x>=30$
- $x''$ is $0$ when $x<60$, and a function $f(x-60)$ when $x>=60$
My strategy now is to obtain restricted cubic spline data through a regular multinomial regression, extract the model-matrix, and feed this into a LASSO function. As stated I'd want to keep the sets of splinevariables together, or dropped completely (just like the dummies for a categorical variable would). However, I have not been able to find a proper function (in R) which does grouped LASSO for multinomial regression. Moreover, the elastic net functions I've found do not support spline fitting within the function, do not support grouping of predictor variables, and/or do not support multinomial regression models.
For example, the 'glmnet' documentation does not mention grouping of variables and the 'gglasso' and 'grplasso' are for binary outcomes only (or so they seem);
in short: is it possible to perform grouped LASSO for a multinomial regression model? Or fit splines within these functions? And if so, how? (which software - I'd prefer an R-package - allows this?)
Ps. This question is related to this one, but looking at the comments and answers there I am definitely looking for something else. The 'type.multinomial = "grouped"' option in glmnet does not keep specific variables together, but instead keeps a single variable in for all outcomes of a multinomial regression, if it is retained for any one of the outcomes (i.e. even when using this option I still see certain splinevariables dropped while related splinevariables are retained). Further, as stated, the answer provided there (using gglasso) does not apply to multinomial regression.