Logistic Regression with Categorical Variables: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred

Question

In my model, I have a Response variable, 0s or 1s.

I have 15 categorical variables, some of which have 150+ levels. Should I potentially exclude them from my model?

When I run full model <- glm(Response ~ Category1 + Category2 + ... + Category15 -1, data=dataframe, family="binomial") I get:

1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred

Should I exclude Categories with many levels? Ideally, I would exclude them by running anova with the full model and model with that category omitted as:

anova(fullmodel, model_test, test="LRT")

Note, that for the model_test the GLM converges fine.

Problems occur when you don't have the information content to support estimation of the parameters. If you don't use penalized maximum likelihood estimation, you'll need roughly 15 observations times the number of categories (all predictors combined). At a minimum you need at least 15 observations (and ideally 96) in the least frequent category over all predictors. — Frank Harrell, Sep 17 '17 at 13:27
My data set has 160,000 observations, but I could still have 1 category, where a single level only has 1 observation... I'm guessing that's where the problem comes from. What do you suggest to do? — GRS, Sep 17 '17 at 13:32
Also I have some blanks in categorical variables, I don't how R handles blanks, but I assumed it was okay — GRS, Sep 17 '17 at 13:33
@GRS from your explanation, it seems like a problem of sparsity. Perhaps you could collapse some groups. Blanks would be treated as NA. R apply list wise deletion whenever there are blanks. — tatami, Sep 17 '17 at 13:33
When I remove categories with a large level of factors, it seems to run fine... but then I'm missing 3 categories: e.g. City. I was thinking of trying to run K-means on the data, to get some insight this way — GRS, Sep 17 '17 at 13:35
Link to dataset if interested: https://drive.google.com/file/d/0B4hDcTPeFihWSnhtamJaVDdMWGQyOGpRR3Q0OUE2RFBPU0Zv/view?usp=sharing — GRS, Sep 17 '17 at 13:48
You have several fundamental problems including proper multiple imputation of missing values that will require a lot of study. My [RMS notes](http://www.fharrell.com/p/blog-page.html) cover many issues related to your goals, and point you to many articles you will need to read to properly understand the issues. It would be a mistake to proceed at this point. — Frank Harrell, Sep 17 '17 at 16:23
Excluding categories is one of the options mentioned in https://stats.stackexchange.com/questions/45803/logistic-regression-in-r-resulted-in-perfect-separation-hauck-donner-phenomenon and https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression and https://stats.stackexchange.com/questions/5354/logistic-regression-model-does-not-converge — Sycorax, Aug 21 '18 at 23:29

Logistic Regression with Categorical Variables: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred

0 Answers0