1

Suppose we have a categorical field in our data set with two classes. The field contains 5% missing values. Of the remaining 95% values, 47.4% belong to class A and 47.6% belong to class B. Normally, we use mode to impute the missing value with the class having the highest frequency. But in this case, since the two classes have almost the same frequency, would it make sense to impute 5% of the values with class B? What would be the best approach to handle such a situation?

A Suresh
  • 13
  • 2
  • But why does that matter? What if the two classes have completely different distributions, why would you impute using only data from B? – user2974951 Dec 13 '21 at 13:12
  • The field is categorical. As per the question, the mode is class B(accounting for 47.6% of the total data). Distributions are used for continuous variables. – A Suresh Dec 13 '21 at 13:23
  • My bad, I understood that you wanted to impute values from other variables based on the group variable. Distributions are not defined only for continuous variables, they are defined for discrete variables as well. As for your original question, the best would be to not impute anything at all, if possible, especially since your two classes are practically equal in size, so you would likely be doing yourself a disfavor to impute them with "B". – user2974951 Dec 13 '21 at 13:30
  • Thanks for answering and also for pointing out that categorical variables also have distributions. I know about Bernoulli's distribution, but it somehow didn't occur to me before. I guess you could write a separate answer elaborating your point... – A Suresh Dec 14 '21 at 09:41

1 Answers1

1

In my personal opinion, imputation very often crosses the boundary into "simulating knowledge we do not have" territory. This sounds like it may well be such a case.

My first impulse would be to use three possible values in your categorical field: the two you have and "missing". Then re-train your model. An advantage is that then you can also apply your model to cases where this field is missing "in production".

Whatever way forward you decide on: do a sensitivity analysis. Fill all missings with one category, then with the other. How much do the predictions change? If your method of addressing the missingness (which could be imputation or something else, per above) has a major impact on your evaluation metric, then this a point where you may want to invest more resources - either in collecting the missing data, or in finding some workaround.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • I forgot to mention that I am not using the data for making predictions. I only want to clean the data for doing some analysis in MySQL. – A Suresh Dec 14 '21 at 09:45