I am trying to apply glmnet
's lasso to a set of features in which there are multiple categorical variables with multiple levels. My intention is to let lasso reduce some of the coefficients of the features down to 0, so that they can be thrown out. Some of my categorical predictors have as many as 50 levels.
The result is that the glmnet
is throwing out only some of the levels of the categorical predictors, but keeping some others. My understanding is that this is incorrect - the dummified levels are all a part of the same predictor - so one needs to throw out the entire predictor (with all it's levels) or keep all of them. The data set is large, maybe 600,000 rows. I am trying to predict a binary outcome between two classes.
Here is an example of my code:
library(glmnet)
x <- model.matrix(project.status~., data = data_train)
y <- data_train$project.status
lasso.net <- cv.glmnet(x, y, alpha = 1, family = "binomial", nfolds = 5,
type.measure = "auc")
and the output:
>coef(lasso.net)
86 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -9.099069e-02
(Intercept) .
project.gradeGrades 6-8 .
project.gradeGrades 9-12 6.891742e-02
project.gradeGrades PreK-2 3.107257e-03
project.resourceBooks .
project.resourceClassroom Basics 8.106422e-01
project.resourceComputers & Tablets 5.849269e-01
project.resourceEducational Kits & Games 8.442034e-01
project.resourceFlexible Seating 5.031112e-01
project.resourceTrips .
project.resourceVisitors .
project.cost -5.286631e-04
school.metro.typesuburban 1.991501e-02
school.metro.typetown -8.060338e-02
school.metro.typeurban 2.380249e-01
school.percent.lunch 4.178175e-04
school.stateAlaska .
school.stateArizona -2.095588e-01
school.stateArkansas -1.652419e-01
school.stateCalifornia 1.209260e-03
school.stateColorado 6.299693e-02
school.stateConnecticut 1.186827e-01
school.stateDelaware 1.829217e-01
school.stateDistrict of Columbia 4.099672e-01
school.stateFlorida .
school.stateGeorgia -2.292140e-01
I've not included the entire output (it's long) but hopefully this presents my issue. The school.state
predictor is a categorical predictor with the 50 states. I essentially want to see if I can throw this predictor out, but instead of zeroing the entire predictor, it is only zeroing out some of the states, and keeping the others. Likewise with the project resource and project grade (it's essentially a charity project, I am trying to predict whether they met their funding goal or not).