1

In building a logistic regression model to predict if a product meets the standard, the data looks like below.

enter image description here

One Production Batch contains different Products.

There are always some Production Batches the whole batch doesn’t meet the standard. There are always batches meet too, on the other hand.

For the Production Batches that always (all products under the batch, like 113144) failed to meet the standard, and the Batches that always (all products under the batch, like 345118) meet the standard, would it be better to exclude them in the model building?

Thank you.

Mark K
  • 235
  • 1
  • 8

1 Answers1

2

Excluding them seems like a bad idea as presumably they carry information about whatever variables you are interested in. The thing to beware of is the possibility that your model may suffer from separation which occurs when for some value or values of the covariates all units are either zero or one. There are many posts on this site about separation which you can examine if it happens. You will know it happens as the coefficients in your logistic regression will go off to infinity (or minus infinity). What you do depends on your scientific question.

How to deal with perfect separation in logistic regression?

contains much valuable advice especially in the answer by scortchi

mdewey
  • 16,541
  • 22
  • 30
  • 57