Dummy variable method for missing data in ML/predictive models

Question

I'm looking for references on the use of zero-imputation with dummy-variable augmentation in the context of predictive models and MNAR missingness. Basically, the idea is that one imputes zero for any missing datum, and adds a column to the design matrix for each variable that has been imputed like so. The idea is that the average effect of the missingness mechanism is picked up by the missingness variable, and that no signal is transmitted by the zero-inputed missing value.

I'm curious how this works in tree-based methods (I imagine that it doesn't?), in penalized regression, and in neural nets. This method has the obvious appeal of being automatic and low-cost in the context of algorithms that are robust to large numbers of variables (if it works).

I'm aware that this creates biased coefficients in the context of inferential statistics.

It seems to me that it *should* work with tree-based methods, because the dummy variable allows the trees to be split on the "missing / not missing" variable. Maybe better with random forests than GBM, which doesn't build deep trees as a matter of design, and is therefore less likely to split on the missingness variables. — jbowman, Jan 29 '19 at 20:42
Yeah I go back and forth thinking about it and want to see if someone has looked into it rigorously. — generic_user, Jan 29 '19 at 20:56
agreed it should work, but can also just assign a dummy number in a tree eg -999 (as xgboost does) https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface — seanv507, Jan 29 '19 at 20:57
My reservation with the tree-based method is that there would be a default towards lumping the imputees with the true zeros (or true whatevers if you use -999 or something else). If the tree doesn't split on the missingness variable, then that'll bias the predictions. — generic_user, Jan 29 '19 at 21:00
@seanv507 - the problem with the dummy number in the tree approach is that there may be a better split in the variable than {-999, everything else} even though the -999 values are technically invalid, so the better split will be used in part *because* of the invalid values. However, this can't happen with the dummy variable, as there are only two values. Still, maybe that doesn't happen often. — jbowman, Jan 29 '19 at 21:15
For regularisation purposes, using the mean appears better. Since then the missing coefficient encodes difference from mean..and we penalize those deviations — seanv507, Jan 06 '20 at 07:17

Dummy variable method for missing data in ML/predictive models

0 Answers0

Linked