Categorical variables in LASSO regression

Question

I just built a logistic regression model via Lasso Penalization.

Now I'm trying to interpret the coefficients. One is "days". I have a coefficient for "days". when I do a normal logistic regression with glm, I usually get 6 coefficients, lets say monday is my reference category, so i'll get coefficients for tuesday till sunday. but after I built my model with lasso, I only get one coefficient, which is "days".

My question is now, how do I interpret the coefficient "days" in Lasso Regression?

Are you simply saying you created a new logical variable weekday indicating if the day is, indeed, a weekday? I can't see this has anything to do with LASSO. — AdamO, Aug 26 '15 at 18:42
im sorry, i mean "day" like my variable is "day" with values "mon", "tues" .."sunday" usually when i do a logistic regression, i'll get coefficients estimators for every day of the week. but after i did lasso regression. i only get one! coefficient. which is day. and i dont know how to interpret the coefficient "day" — ching, Aug 26 '15 at 18:44
hmmmm, first of all, thank you very much for your help. i really appreciate it!! i'm using the cv.glmnet function in R. and before i built the function, i had to convert my data from a data.frame into a data.matrix datas — ching, Aug 26 '15 at 19:09

score 7 · Answer 1 · edited Feb 03 '19 at 22:18

Factor variables in R and other software are automatically parsed out into several categorical factors. So for instance, if I create a variable

n <- 100
dayn <- sample(1:7, n, replace=T)
dayf <- factor(dayn, levels=1:7, labels=c('Sun', 'Mon', 'Tues', 'Weds', 'Thurs', 'Fri', 'Sat'))

and I analyze it in a linear regression model, the regression model automatically creates the binary variables, taking "Sunday" as the referent level. Each factor gives a comparison of a day of the week versus Sunday in regression models. Sunday vs Sunday is redundant, so it is dropped.

For instance:

mm <- model.matrix(~dayf)
head(mm)

Gives me:

> head(mm)
  (Intercept) dayfMon dayfTues dayfWeds dayfThurs dayfFri dayfSat
1           1       1        0        0         0       0       0
2           1       0        1        0         0       0       0
3           1       0        0        0         0       1       0
4           1       0        0        0         0       1       0
5           1       0        0        1         0       0       0
6           1       1        0        0         0       0       0

Suppose further I had a outcome variable which is Poisson distributed... yet I analyze it with a linear regression model because I can

    sickdays <- rpois(n, lambda = exp(1 + 2*(dayf %in% c('Monday','Tuesday'))))
    boxplot(sickdays ~ dayf)

Now if my hypothesis is "Does day of the week affect the number of people taking sick days?" an appropriate test of the hypothesis may come from a 6 degree of freedom test concerning whether or not there is any statistically significant difference in mean sick days among any of the days of the week. Note that I am not concerned with exactly which day is affected. The regression model gives me 6 separate coefficients

library(lmtest)
big.model <- lm(sickdays ~ dayf)
summary(big.model)
null.model <- lm(sickdays ~ 1)
lrtest(big.model, null.model)

Depending on your seed, the likelihood ratio test may or may not be significant and the 6 separate Wald tests may or may not be significant. The problem with the 6 separate Wald tests is multiple testing is applied.

This relates to LASSO because with factors we do not hypothesize that separate levels may be predictive. So we either include all factor levels as a "feature" or not.

As a reminder, LASSO does feature selection. What is a feature? In a regression model, the particular comparison "Tuesday vs Sunday" or "Friday vs Sunday" is not a feature. The 6 level factor coming from dayf is considered a feature. So for model selection, it is all or nothing. Either all 6 factors are included, along with their penalization, or they are excluded.

From a theoretical perspective this makes sense. If I kept "Tuesday vs Sunday" as a factor and no other factors, this factor no longer means "Tuesday vs Sunday", but becomes "Tuesday vs every other day", that means there are significant practical differences in how that factor is interpreted when the model is expanded to include (what usually is) Wednesday vs Sunday. In that case, the two factors are Tuesday vs S/M/Th/F/Sa and Wednesday vs S/M/Th/F/Sa. And you cannot compare them.

wow thats a really great answer! thank you very much adam! in my case i have a coefficient of -7.868839e-03 for my variable "days". so that means, since its not "0", one of the days has an influence on my dependent variable right? but how exactly do i interpret the value -7.868839e-03 now? — ching, Aug 26 '15 at 19:31
[1] "factor" but i had to convert my data from a data.frame into a data.matrix datas — ching, Aug 26 '15 at 21:37
I don't believe this is true for glmnet. Basically the independent variables input is just a matrix, and feature selection means eg selecting/rejecting a single dummy variable corresponding to each level of the factor. Group lasso was designed, I believe, to do what you suggest... Keeping the whole factor or dropping the whole factor. — seanv507, Feb 04 '19 at 00:05

Categorical variables in LASSO regression

1 Answers1

Linked