Is it possible to fit a logistic regression model to a dataset with categorical predictive variables with very high number of levels each?

Question

I want to fit a model to a very large dataset, with a standard binary response variable and with 3 categorical predictor variables with 3000, 15 and 2 levels. Is there any inherent problem in this case, if I want to use logistic regression? Will it be computationally manageable using R?

Yes it's doable if you have a large enough sample size (recall that each additional level essentially eats up a degree of freedom). Interpretability will probably be an issue, however. Can you give more information about your dataset? — Affine, Sep 16 '14 at 13:32
In particular with respect to information about your dataset, What type of categorical variable has 3000 levels? — EdM, Sep 16 '14 at 13:37
Are you sure that all the 3000 levels are distinct and that many cannot be combined? — JenSCDC, Sep 16 '14 at 14:59
I'm analysing genomic data, and trying to figure out if some types of mutations are more fit than others. So there is a response variable categorizing them as fit/unfit with some algorithm, the predictors with 2 and 12 levels correspond to some factors with known mutation biases. My variable of interest (mutation type) can be formulated in ~3000 levels in the most comprehensive approach, and all of these levels are independent for sure. My sample size (number of mutations) is in the order of 10 millions by the way. — Macond, Sep 16 '14 at 16:05
See https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Apr 01 '21 at 00:33

score 1 · Accepted Answer · answered Sep 16 '14 at 15:32

yes you can do this. I would recommend glmnet. This takes a matrix (rather than a dataframe) as input. The creation of sparse model matrices seems a bit memory consuming - have a look at this blog post: creating sparse dummy variables matrix for glmnet

Is it possible to fit a logistic regression model to a dataset with categorical predictive variables with very high number of levels each?

1 Answers1