13

I am surprised that R’s glm will “break” (not converge with default setting) for the following “toy” example (binary classification with ~50k data, ~10 features), but glmnet returns results in seconds.

Am I using glm incorrectly (for example, should I set max iteration, etc.), or is R’s glm not good for big data setting? Does adding regularization make a problem easy to solve?

d=ggplot2::diamonds
d$price_c=d$price>2500
d=d[,!names(d) %in% c("price")]

lg_glm_fit=glm(price_c~.,data=d,family = binomial())

library(glmnet)
x=model.matrix(price_c~.,d)
y=d$price_c
lg_glmnet_fit=glmnet(x = x,y=y,family="binomial", alpha=0)

Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 

EDIT: Thanks for Matthew Drury and Jake Westfall's answer. I understand the perfect separation issue which is which is already addressed. How to deal with perfect separation in logistic regression?

And in my original code, I do have the third line which drops the column that derives the label.

The reason I mention about "big data" is because in many "big data" / "machine learning" settings, people may not carefully test assumptions or know if data can be perfectly separated. But glm seems to be easily broken with "unfriendly" messages, and there is not a easy way to add the regularization to fix it.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • 3
    In general R won't work in big data setting since it loads all your data into RAM, so for terabytes of data you would need even more terabytes of RAM... – Tim Oct 11 '16 at 10:49
  • 5
    Regarding " Does adding regularization make a problem easy to solve": yes, it can. See http://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf: "Inclusion of $\lambda$ makes problem non-singular even if $Z^T Z$ is not invertible" – Adrian Oct 11 '16 at 10:51
  • 4
    ...and your perfect separation / non-invertibility issue doesn't necessarily have anything to do with the size of your training data (i.e. whether you're dealing with "big data" or not) – Adrian Oct 11 '16 at 10:53
  • 9
    Personally, I think those messages are very friendly. If you have a 50,000 x 10 data set with perfect separation, and don't realize it, I don't think you've learned enough about your data. – Cliff AB Oct 11 '16 at 14:53
  • 4
    `glm` doesn't fit regularized models because it's not designed to do so. Regularization will induce convexity and hence finite parameter estimates; however, it is not uniformly true that regularization is the correct answer in all contexts. Even if convex models are desired, they can be constructed from alternative, non-regularized models, so it's a matter of choosing the correct tool for the task at hand. Please see Scortchi's answer here: http://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression – Sycorax Oct 11 '16 at 15:20
  • @CliffAB: in order to be friendly, the messages would want to say *"fitted probabilities numerically 0 or 1 occurred **for variable varA [,varB,...]**"* – smci Oct 11 '16 at 22:11
  • 6
    @smci: unfortunately, that's not really a viable option. For a variety of reasons, it's not so easy to pinpoint the "responsible variable" in this problem, nor is the idea of a responsible variable even well defined. They **can** tell you that your estimated probabilities are numerically equal to 0 or 1, which should indicate that there are potential problems with your model. – Cliff AB Oct 11 '16 at 22:27
  • 1
    This is an FAQ: http://stats.stackexchange.com/search?q=fitted+probabilities+numerically+0+or+1+occurred. But it can't be said often enough that nothing's "breaking": predicted probabilities of zero or one aren't inherently absurd, & it's only the Wald standard errors that'll be way out. – Scortchi - Reinstate Monica Oct 12 '16 at 14:48

2 Answers2

17

This has nothing to do with glm, you simply created a problem with an artificial perfect separation:

df <- data.frame(x = rnorm(100), y = rnorm(100))
df$y_c = df$y > 0

glm(y_c~., data=df, family=binomial())

Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred

y is a perfect predictor of y_c.

Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
  • 6
    While I think that perfect separation is still the basic issue, @hxd1011 excludes the price variable from d before fitting the model... see third line of code, though I am a bit unsure whether this works with operator precedence. – Erik Oct 11 '16 at 10:45
  • 1
    @Erik even without that variable, it's possible to have perfect separation. Even `lg_glm_fit=glm(price_c ~ carat, data=d, family = binomial())` starts to run into problems – Adrian Oct 11 '16 at 13:12
  • 2
    @Adrian Sure, that is why I said that the basic issue is correctly identified. However, note that this is not *artificial* perfect separation anymore. – Erik Oct 11 '16 at 13:33
  • I wouldn't get too hung up on Matthew's use of the word "artificial". Perfect separation is the issue. This isn't a problem with `glm` or with big data -- it's a problem with fitting an unregularized model when the solution is not unique (which can happen in big or small datasets). – Adrian Oct 11 '16 at 13:46
  • 1
    @Adrian True, which is why I did not downvote. Still, my comment and yours right now would improve the answer if they got included. – Erik Oct 11 '16 at 14:11
  • I meant "artificial" in that the poster created it by leaving `y` in the regression, which should not be there. Even if it did *not* fail so catastrophically, it would still be data leakage. – Matthew Drury Oct 11 '16 at 14:44
  • I must say @Erik, I unfortunately don't quite see your point on the price variable. But if you think it would be better to add some things to the answer, please feel free to edit. – Matthew Drury Oct 11 '16 at 14:45
  • 2
    @MatthewDrury y is taken out of the data (and the model) in the third line of the code. – Erik Oct 11 '16 at 14:52
  • 1
    Oh. So it is. Why is this getting so many upvotes? I feel a fraud. – Matthew Drury Oct 11 '16 at 15:13
  • @MatthewDrury I think you posted your answer before `d=d[,!names(d) %in% c("price")]` was added (i.e. the original question was edited), right? In any case, there's still a problem of perfect separation after that edit, but it's no longer "artificial" – Adrian Oct 11 '16 at 15:23
  • 1
    @Adrian my original question has that line... but both answers fail to see that. – Haitao Du Oct 11 '16 at 22:50
16

The unregularized model is suffering from complete separation because you are trying to predict the dichotomized variable price_c from the continuous variable price from which it is derived.

The regularized model avoids the problem of complete separation by imposing a penalty that keeps the coefficient for the price predictor from going off to $\infty$ or $-\infty$. So it manages to converge fine and work well.

You should remove the continuous price predictor from the design matrix in this toy example.

Edit: As @Erik points out, the continuous price predictor is already removed from the design matrix, which I somehow missed. So the complete separation arises from some other predictor or combination of predictors.

It's also worth adding that, of course, none of these issues have anything to do with the particular implementation of logistic regression in R's glm() function. It is simply about regularized vs. unregularized logistic regression.

Jake Westfall
  • 11,539
  • 2
  • 48
  • 96