1

I am in the early phase of a new project about looking at multiple factors that potentially influence the probability that an item fails quality inspection.

I am interested in seeing whether each operator, machine, product, and change-over responsible significantly affect the failure rate.

The problem is that I know that the items have different failure rates, also within their aggregate groups as product family, etc.
So should I just attempt to continue with the project, and make a categorical variable with 100's of categories? What are the risks and downsides to this, and are there any obvious things I should do differently?

(I guess as a side-note I can mention that what I really wanted was an N-way ANOVA but that was not possible as I am looking at binary data.)

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Patrick
  • 13
  • 2
  • That categorical variable with 100s of levels code the different items? You could do that, and build a logistic regression using regularization (probably fused lasso), see the answers [here](https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels). If some macines (or operators, ...) are irrelevant for some items, see [here](https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model/372258#372258). Alternatively, you could try separate models for each item, but joint is prob. better. – kjetil b halvorsen Oct 10 '19 at 20:54

1 Answers1

0

I am assuming that categorical variable with 100s of levels code the different items. You can certainly do that, and build a logistic regression model, probably using some form of regularization, maybe fused lasso, see Principled way of collapsing categorical variables with many levels?. That will give a model with a separate intercept for each item, but the slopes for the other covariables will be equal. If that seems unreasonable, you can try to include interactions, but be careful, the number of parameters could explode ... If some machines (operators ...) are irrelevant for some items, see How do you deal with "nested" variables in a regression model?.

Alternatively, you could build separate models for each item, but that would require an enormous sample size, so the first option is probably preferable.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467