Logit model for hundres of items - can and should I use the items as a category variable?

Question

I am in the early phase of a new project about looking at multiple factors that potentially influence the probability that an item fails quality inspection.

I am interested in seeing whether each operator, machine, product, and change-over responsible significantly affect the failure rate.

The problem is that I know that the items have different failure rates, also within their aggregate groups as product family, etc.
So should I just attempt to continue with the project, and make a categorical variable with 100's of categories? What are the risks and downsides to this, and are there any obvious things I should do differently?

(I guess as a side-note I can mention that what I really wanted was an N-way ANOVA but that was not possible as I am looking at binary data.)

That categorical variable with 100s of levels code the different items? You could do that, and build a logistic regression using regularization (probably fused lasso), see the answers [here](https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels). If some macines (or operators, ...) are irrelevant for some items, see [here](https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model/372258#372258). Alternatively, you could try separate models for each item, but joint is prob. better. — kjetil b halvorsen, Oct 10 '19 at 20:54

kjetil b halvorsen · Accepted Answer · 2019-10-13T22:38:36.857

I am assuming that categorical variable with 100s of levels code the different items. You can certainly do that, and build a logistic regression model, probably using some form of regularization, maybe fused lasso, see Principled way of collapsing categorical variables with many levels?. That will give a model with a separate intercept for each item, but the slopes for the other covariables will be equal. If that seems unreasonable, you can try to include interactions, but be careful, the number of parameters could explode ... If some machines (operators ...) are irrelevant for some items, see How do you deal with "nested" variables in a regression model?.

Alternatively, you could build separate models for each item, but that would require an enormous sample size, so the first option is probably preferable.

Logit model for hundres of items - can and should I use the items as a category variable?

1 Answers1