I am surprised that R’s glm
will “break” (not converge with default setting) for the following “toy” example (binary classification with ~50k data, ~10 features), but glmnet
returns results in seconds.
Am I using glm
incorrectly (for example, should I set max iteration, etc.), or is R’s glm
not good for big data setting? Does adding regularization make a problem easy to solve?
d=ggplot2::diamonds
d$price_c=d$price>2500
d=d[,!names(d) %in% c("price")]
lg_glm_fit=glm(price_c~.,data=d,family = binomial())
library(glmnet)
x=model.matrix(price_c~.,d)
y=d$price_c
lg_glmnet_fit=glmnet(x = x,y=y,family="binomial", alpha=0)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
EDIT: Thanks for Matthew Drury and Jake Westfall's answer. I understand the perfect separation issue which is which is already addressed. How to deal with perfect separation in logistic regression?
And in my original code, I do have the third line which drops the column that derives the label.
The reason I mention about "big data" is because in many "big data" / "machine learning" settings, people may not carefully test assumptions or know if data can be perfectly separated. But glm
seems to be easily broken with "unfriendly" messages, and there is not a easy way to add the regularization to fix it.