7

I saw a post that used the following data:

library(glmnet)

age     <- c(4, 8, 7, 12, 6, 9, 10, 14, 7) 
gender  <- as.factor(c(1, 0, 1, 1, 1, 0, 1, 0, 0))
bmi_p   <- c(0.86, 0.45, 0.99, 0.84, 0.85, 0.67, 0.91, 0.29, 0.88) 
m_edu   <- as.factor(c(0, 1, 1, 2, 2, 3, 2, 0, 1))
p_edu   <- as.factor(c(0, 2, 2, 2, 2, 3, 2, 0, 0))
f_color <- as.factor(c("blue", "blue", "yellow", "red", "red", "yellow", 
                       "yellow", "red", "yellow"))
asthma <- c(1, 1, 0, 1, 0, 0, 0, 1, 1)

xfactors <- model.matrix(asthma ~ gender + m_edu + p_edu + f_color)[, -1]
x        <- as.matrix(data.frame(age, bmi_p, xfactors))

# Note alpha=1 for lasso only and can blend with ridge penalty down to
# alpha=0 ridge only.
glmmod <- glmnet(x, y=as.factor(asthma), alpha=1, family="binomial")

# Plot variable coefficients vs. shrinkage parameter lambda.
plot(glmmod, xvar="lambda")

It seems that they are doing LASSO regression on a dichotomous variable. I am wondering if this is even valid? If so, how are they making sure the Y variables make sense in the context of the problem?

user321627
  • 2,511
  • 3
  • 13
  • 49

1 Answers1

9

It is valid. Note the family="binomial" argument which is appropriate for a classification problem. A normal lasso regression problem would use the gaussian link function.

In this setting, it allows you to estimate the parameters of the binomial GLM by optimising the binomial likelihood whilst imposing the lasso penalty on the parameter estimates. The dichotomous response is perfectly fine here.

This is useful because it allows feature selection or parameter shrinkage to avoid overfitting.

dcl
  • 2,610
  • 3
  • 19
  • 30
  • 2
    Is there a mathematical formula describing what is exactly going on? It is a bit confusing still – user321627 Apr 17 '18 at 04:06
  • 2
    Check out the 'Logistic Regression' section in the glmnet vignette: https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet_beta.pdf . You can see how the likelihood for an ordinary binomial GLM (logistic regression) is derived here: http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf – dcl Apr 17 '18 at 05:48