The answer is very easy to the question in the title, we just need a data set where $n<p$; thus, logistic regression MLE does not exist. Also if perfect separation occurs, ridge or lasso will be better (than nothing). However, I want to generate a binary outcome data set, where the MLE for the logistic regression exists, but still, ridge and/or lasso clearly outperforms it in terms of expected accuracy.
From various sources, I found that multicollinearity causes simple logistic regression (and OLS) to perform much worse. Also, for continuous outcome variables, there are a few examples out there where penalized regression is clearly better than OLS (here, or Section 1.1 in Bishop's pattern recognition book). However, I did not find such examples for binary output data. I also tried numerous different scenarios, and simple logistic regression is only worse when MLE does not exist.
Here is one of my attempts below. I used correlated multivariate normal distribution to generate $X$, and then the class labels $y$ were generated from an unknown logistic model with parameters $\beta$. I changed the training set size, correlation matrices, and coefficients for data-generating, the optimization criterion for cv.glmnet
but none of these helped the regularized methods. I would be grateful if someone could come up with a simple example when lasso or ridge clearly wins in terms of accuracy.
library(glmnet)
library(MASS)
Accuracy <- function(y, prob) mean(y == (prob > .5))
#generating data from logistic model
#beta is the vector of true data generaing coefficients
#cor_mat is the correlation matrix of the predictor variables
generate_data <- function(n, beta, cor_mat){
n_pred <- length(beta) - 1 #numper of predictors
x <- mvrnorm(n, rep(0, n_pred), cor_mat)
design_matrix <- cbind(1, x)
#true probabilities from logistic model
p <- as.numeric(1/(1+exp(- design_matrix %*% beta)))
#drawing outcomes from Bernoulli distribution
y <- rbinom(n, 1, p)
dat <- as.data.frame(x)
dat$y <- y; dat$p <- p
return(dat)
}
#autoregressiv correlation matrix
create_exponential_correl_mat <- function(n_pred, gamma){
a <- c((n_pred-1):0, 1:(n_pred-1))
exp_mat <- sapply(n_pred:1, function(x) a[x:(x+n_pred-1)])
gamma^exp_mat
}
n_sim <- 50
n_t <- 100 #size of training set
n_pred <- 20 #number of predictors
beta <- rep(0, n_pred+1)
coefs <- c(2,2,-2,-2)
beta[2:(length(coefs)+1)] <- coefs
beta #data generating coefficient vector
> 0 2 2 -2 -2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The true data generating coefficient. The first element is the intercept.
correl_mat <- create_exponential_correl_mat(n_pred, 0.9)
correl_mat[1:5, 1:5]
> [,1] [,2] [,3] [,4] [,5]
> [1,] 1.0000 0.900 0.81 0.729 0.6561
> [2,] 0.9000 1.000 0.90 0.810 0.7290
> [3,] 0.8100 0.900 1.00 0.900 0.8100
> [4,] 0.7290 0.810 0.90 1.000 0.9000
> [5,] 0.6561 0.729 0.81 0.900 1.0000
The upper left corner of the correlation matrix.
log_acc <- ridge_acc <- lasso_acc <- numeric(n_sim)
set.seed(1337)
for (i in 1:n_sim){
dat <- generate_data(n_t, beta, correl_mat)
#Fitting the models on the data
logist_m <- glm(y ~. -p, family = "binomial", data = dat)
x <- model.matrix(y ~ .-p, dat)
cv_ridge <- cv.glmnet(x, dat$y, family = "binomial", alpha = 0)
cv_lasso <- cv.glmnet(x, dat$y, family = "binomial", alpha = 1)
#Using an external data set to evaluate the models
external <- generate_data(10^5, beta, correl_mat)
x <- model.matrix(y ~.-p, external)
lg_p <- predict(logist_m, external, type="response")
rg_p <- predict(cv_ridge, x, type="response", s = "lambda.1se")
ls_p <- predict(cv_lasso, x, type="response", s = "lambda.1se")
log_acc[i] <- Accuracy(external$y, lg_p)
ridge_acc[i] <- Accuracy(external$y, rg_p)
lasso_acc[i] <- Accuracy(external$y, ls_p)
}
result_df <- data.frame(log_acc, ridge_acc, lasso_acc)
result_df$diff_ridge <- result_df$ridge_acc - result_df$log_acc
result_df$diff_lasso <- result_df$lasso_acc - result_df$log_acc
apply(result_df, 2, mean)
> log_acc ridge_acc lasso_acc diff_ridge diff_lasso
> 0.7275000 0.6997832 0.7247368 -0.0277168 -0.0027632
So the accuracy of logistic regression was 2.7 % better than the optimal ridge and 0.27 % better than optimal lasso estimates, although 16 of the 20 predictors were independent and strong collinearity was present (every variable had at least another variable with which the correlation was 0.9).
The question is very similar to this, but I use external data set in this case, and I consider binary classification.