2

Is there a method for regression (my variable Y has binomial distribution ) when n>>p ? I have a data set with 350000 observations and 4 variables. I am using R (I also using SAS).

some help would be appreciated

user44677
  • 277
  • 3
  • 10
  • 3
    What's wrong with using regular logistic regression when you have a lot of data? I don't see any problem here. – gung - Reinstate Monica May 10 '17 at 16:20
  • 2
    Heck, with so many observations and only 4 variables, why not go completely nonparametric? – generic_user May 10 '17 at 16:42
  • 2
    Tell us a little bit more about your goals. Do you need an interpretable model? Then go for logistic regression, as suggested by @gung (actually, in this kind of problem I would always start with l.r., to have some kind of baseline). Do you need to maximise predictive accuracy? Nonparametric approaches such as neural networks, random forest, SVM, gradient boosting, etc. may work. If you give us more details on the problem you can get a more specific suggestion. With only 4 features I don't think you'll ever get grear results, though (unless it's a very simple problem). – DeltaIV May 10 '17 at 21:14
  • Thanks for replies. Yes, I need an interpretable model so I did logistic regression with R but I got this **Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred**. Can we use a form of penalized logistic regression ? – user44677 May 11 '17 at 07:31
  • My data set have 259 000 observations and the binary variable Y=1 has low occurrence. How can we deal with that ? – user44677 May 11 '17 at 11:54
  • 1. please use the @username tag to notify about your replies, or I'll miss them. 2. penalized logistic regression is super-easy in R with package `glmnet`. 3. there are a lot of questions here about using logistic regression on unbalanced data sets, such as for example https://stats.stackexchange.com/questions/91305/how-to-choose-the-cutoff-probability-for-a-rare-event-logistic-regression. Try searching the site for more. – DeltaIV May 12 '17 at 14:36
  • 4. finally, a classic paper on learning with unbalanced data sets: http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf – DeltaIV May 12 '17 at 14:43
  • 2
    Ps note that often the message you're getting is due to a linearly separable problem. Logistic regression models can't be stably fitted to linearly separable problems (search the site or see https://stats.stackexchange.com/questions/279701/r-logistic-regression/279717#comment535962_279717). I'm actually surprised that with ~ 260k samples and just 4 variables you get a linearly separable problem, but it could happen. For example, the output could be just the sign of one of your inputs! Check for such trivial stuff. – DeltaIV May 15 '17 at 16:26

1 Answers1

1

Partially answered in comments:

What's wrong with using regular logistic regression when you have a lot of data? I don't see any problem here. – gung

Heck, with so many observations and only 4 variables, why not go completely nonparametric? – generic_user

Tell us a little bit more about your goals. Do you need an interpretable model? Then go for logistic regression, as suggested by @gung (actually, in this kind of problem I would always start with l.r., to have some kind of baseline). Do you need to maximize predictive accuracy? Nonparametric approaches such as neural networks, random forest, SVM, gradient boosting, etc. may work. If you give us more details on the problem you can get a more specific suggestion. With only 4 features I don't think you'll ever get great results, though (unless it's a very simple problem). – DeltaIV

The OP comments, showing that the real problem maybe is linear separation, leading to fitted probabilities of 0 or 1 (which will not generalize well to new data ...) How to deal with (quasi)separation is answered here: How to deal with perfect separation in logistic regression?

My data set have 259 000 observations and the binary variable Y=1 has low occurrence. How can we deal with that ? – OP

Finally, a classic paper on learning with unbalanced data sets. – DeltaIV

Sven Hohenstein
  • 6,285
  • 25
  • 30
  • 39
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467