3

The answer is very easy to the question in the title, we just need a data set where $n<p$; thus, logistic regression MLE does not exist. Also if perfect separation occurs, ridge or lasso will be better (than nothing). However, I want to generate a binary outcome data set, where the MLE for the logistic regression exists, but still, ridge and/or lasso clearly outperforms it in terms of expected accuracy.

From various sources, I found that multicollinearity causes simple logistic regression (and OLS) to perform much worse. Also, for continuous outcome variables, there are a few examples out there where penalized regression is clearly better than OLS (here, or Section 1.1 in Bishop's pattern recognition book). However, I did not find such examples for binary output data. I also tried numerous different scenarios, and simple logistic regression is only worse when MLE does not exist.

Here is one of my attempts below. I used correlated multivariate normal distribution to generate $X$, and then the class labels $y$ were generated from an unknown logistic model with parameters $\beta$. I changed the training set size, correlation matrices, and coefficients for data-generating, the optimization criterion for cv.glmnet but none of these helped the regularized methods. I would be grateful if someone could come up with a simple example when lasso or ridge clearly wins in terms of accuracy.

library(glmnet)
library(MASS)

Accuracy <- function(y, prob) mean(y == (prob > .5)) 

#generating data from logistic model
#beta is the vector of true data generaing coefficients
#cor_mat is the correlation matrix of the predictor variables
generate_data <- function(n, beta, cor_mat){
  n_pred <- length(beta) - 1 #numper of predictors
  x <- mvrnorm(n, rep(0, n_pred), cor_mat)
  design_matrix <- cbind(1, x)

  #true probabilities from logistic model
  p <- as.numeric(1/(1+exp(- design_matrix %*% beta))) 
  #drawing outcomes from Bernoulli distribution
  y <- rbinom(n, 1, p) 
  dat <- as.data.frame(x)
  dat$y <- y; dat$p <- p
  return(dat)
} 

#autoregressiv correlation matrix
create_exponential_correl_mat <- function(n_pred, gamma){
  a <- c((n_pred-1):0, 1:(n_pred-1))
  exp_mat <- sapply(n_pred:1, function(x) a[x:(x+n_pred-1)])
  gamma^exp_mat
}

n_sim <- 50

n_t <- 100 #size of training set
n_pred <- 20 #number of predictors
beta <- rep(0, n_pred+1) 
coefs <- c(2,2,-2,-2)
beta[2:(length(coefs)+1)] <- coefs
beta #data generating coefficient vector

> 0  2  2 -2 -2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

The true data generating coefficient. The first element is the intercept.

correl_mat <- create_exponential_correl_mat(n_pred, 0.9)
correl_mat[1:5, 1:5]
>            [,1]  [,2] [,3]  [,4]   [,5]
>     [1,] 1.0000 0.900 0.81 0.729 0.6561
>     [2,] 0.9000 1.000 0.90 0.810 0.7290
>     [3,] 0.8100 0.900 1.00 0.900 0.8100
>     [4,] 0.7290 0.810 0.90 1.000 0.9000
>     [5,] 0.6561 0.729 0.81 0.900 1.0000

The upper left corner of the correlation matrix.

log_acc <- ridge_acc <- lasso_acc <- numeric(n_sim)

set.seed(1337)
for (i in 1:n_sim){
  dat <- generate_data(n_t, beta, correl_mat)

  #Fitting the models on the data
  logist_m <- glm(y ~. -p, family = "binomial", data = dat)
  x <- model.matrix(y ~ .-p, dat)
  cv_ridge <- cv.glmnet(x, dat$y, family = "binomial", alpha = 0)
  cv_lasso <- cv.glmnet(x, dat$y, family = "binomial", alpha = 1)
  
  #Using an external data set to evaluate the models
  external <- generate_data(10^5, beta, correl_mat)
  x <- model.matrix(y ~.-p, external)
  lg_p <- predict(logist_m, external, type="response")
  rg_p <- predict(cv_ridge, x, type="response", s = "lambda.1se")
  ls_p <- predict(cv_lasso, x, type="response", s = "lambda.1se")
  log_acc[i] <- Accuracy(external$y, lg_p)
  ridge_acc[i] <- Accuracy(external$y, rg_p)
  lasso_acc[i] <- Accuracy(external$y, ls_p)
}

result_df <- data.frame(log_acc, ridge_acc, lasso_acc)
result_df$diff_ridge <- result_df$ridge_acc - result_df$log_acc
result_df$diff_lasso <- result_df$lasso_acc - result_df$log_acc

apply(result_df, 2, mean)

>        log_acc  ridge_acc  lasso_acc diff_ridge diff_lasso 
>      0.7275000  0.6997832  0.7247368 -0.0277168 -0.0027632

So the accuracy of logistic regression was 2.7 % better than the optimal ridge and 0.27 % better than optimal lasso estimates, although 16 of the 20 predictors were independent and strong collinearity was present (every variable had at least another variable with which the correlation was 0.9).

The question is very similar to this, but I use external data set in this case, and I consider binary classification.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
zerz
  • 51
  • 6
  • 1
    Let's take a training set with only one feature $X$ and the target $Y$. Let's say all the examples where $X<0$ are negative ($Y=\text{false}$) and whenever $X>=0$ then $Y=\text{true}$ and there are many $X$ values in the interval $[-1,1]$. Now you introduce a single example $X=10000000$ (so $Y$ should be true) but $Y=\text{false}$. Then the bias should move towards a very big value because the punishment for misclassification of this outlier is so big. Then regularization should prevent that from happening and L1 or L2 should outperform "normal" logistic regression... – Fabian Werner Jun 14 '21 at 07:55
  • Thanks for the idea! I tried your scenario, but it seems to me that none of the 3 models can handle an outlier like that, they all predict the same probability for the "normal" $X$ values. However, I think it's still a good idea to use something else than Normal distribution for data generation, maybe mixtures would be good – zerz Jun 14 '21 at 08:32
  • section 1.1 of Pattern recognition and machine learning by chris bishop has a worked example with polynomial fitting to a sinusoid + gaussian noise – seanv507 Jun 15 '21 at 08:13
  • Thank you, but as far as I can see he used real-valued output $y$ not binary. I saw some examples where OLS was clearly outperformed by regularized versions, but I am specifically interested in the binary output case. – zerz Jun 15 '21 at 08:19
  • I would simulate the data using the transformation given here: https://stats.stackexchange.com/a/46525/247274. Start out with a linear model that has regularized regression greatly outperform OLS; then use the linear output as the `z` in that answer to transform that value to a probability `pr = 1/(1+exp(-z))` and then the probability to a category `y = rbinom(1000,1,pr) `. – Dave Jun 15 '21 at 13:19
  • Thanks @Dave for your answer, I generate the data very similarly. The only difference is that my `generate_data` function also uses a correlation matrix which introduces correlation between my predictor variables. Also, I was trying to use a similar setting where the OLS would be outperformed by regularized versions. That's why I used strong correlation and lots of independent predictors. – zerz Jun 15 '21 at 13:39
  • Why would you expect lasso or ridge regression to outperform logistic regression (especially logistic regression with a ridge or LASSO regularisation term)? I would have thought that logistic regression, having a more appropriate loss function, would make better use of the available data. Asymptotically as the dataset grows larger, both will give the same solution as they both will estimate the conditional mean of the target distribution? – Dikran Marsupial Jun 15 '21 at 13:56
  • @DikranMarsupial Why do you say that logistic regression has a more appropriate loss function? – Dave Jun 15 '21 at 14:01
  • I would think they were introduced as better alternatives for logistic regression in some situations. Also, for OLS there are some clear examples when ridge or lasso outperforms them. Is there something fundamentally different with logistic regression? – zerz Jun 15 '21 at 14:01
  • @Dave for classification problems the labels are discrete, so adopting a loss function that is designed for discrete labels ought to be a better assumption. In practice it tends not to make much difference in terms of accuracy (at least for kernel logistic and kernel ridge regression). Logistic regression is really more important if you need more calibrated probabilities. – Dikran Marsupial Jun 15 '21 at 14:13
  • I should add Vapnik says that you should avoid solving a problem by solving a more general problem and throwing away some information, so he might prefer ridge regression over ridge-logistic regression if you are only interested in accuracy? – Dikran Marsupial Jun 15 '21 at 14:22
  • 1
    Logistic regression can include regularization penalty, even with the probabilistic predictions. – Dave Jun 15 '21 at 14:30
  • I rarely use logistic regression without it (but I usually have kernels as well) – Dikran Marsupial Jun 15 '21 at 15:21
  • @DikranMarsupial, I am not sure if I understand you well. You are saying that the penalized versions should not outperform logistic, but you always use them anyway. Also, what do you mean I should not solve a more general problem? I am actually interested in binary classifications, so how can I use simple ridge instead of ridge-logistic? – zerz Jun 16 '21 at 09:55
  • 1
    @zerz it depends whether you are interested in accuracy, in which case linear regression is fine (equivalent to Fisher's linear discriminant analysis) or whether you want calibrated probabilities, in which case use logistic regression. Just use -1/+1 labels and put the decision boundary at f(x) = 0. If you estimate the probabilities and then threshold at 0.5, the model is wasting resources modelling data that is not near the decision boundary, which is why e.g. the SVM may give better classification accuracy as it focusses just on the boundary. – Dikran Marsupial Jun 16 '21 at 13:39

0 Answers0