I want to fit a model to a very large dataset, with a standard binary response variable and with 3 categorical predictor variables with 3000, 15 and 2 levels. Is there any inherent problem in this case, if I want to use logistic regression? Will it be computationally manageable using R?
Asked
Active
Viewed 497 times
2

kjetil b halvorsen
- 63,378
- 26
- 142
- 467

Macond
- 478
- 2
- 9
-
Yes it's doable if you have a large enough sample size (recall that each additional level essentially eats up a degree of freedom). Interpretability will probably be an issue, however. Can you give more information about your dataset? – Affine Sep 16 '14 at 13:32
-
3In particular with respect to information about your dataset, What type of categorical variable has 3000 levels? – EdM Sep 16 '14 at 13:37
-
Are you sure that all the 3000 levels are distinct and that many cannot be combined? – JenSCDC Sep 16 '14 at 14:59
-
1I'm analysing genomic data, and trying to figure out if some types of mutations are more fit than others. So there is a response variable categorizing them as fit/unfit with some algorithm, the predictors with 2 and 12 levels correspond to some factors with known mutation biases. My variable of interest (mutation type) can be formulated in ~3000 levels in the most comprehensive approach, and all of these levels are independent for sure. My sample size (number of mutations) is in the order of 10 millions by the way. – Macond Sep 16 '14 at 16:05
-
See https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen Apr 01 '21 at 00:33
1 Answers
1
yes you can do this. I would recommend glmnet. This takes a matrix (rather than a dataframe) as input. The creation of sparse model matrices seems a bit memory consuming - have a look at this blog post: creating sparse dummy variables matrix for glmnet

seanv507
- 4,305
- 16
- 25