I have a very specific situation involving missing data in a regression tree (actually part of a random forest) which is not covered by the most popular related questions:
How do decision tree learning algorithms deal with missing values (under the hood)
Why doesn't Random Forest handle missing values in predictors?
Here is a contrived example where the response Y
is the expenditure of a customer at their next transaction. The average_spend
variable, which measures the average spent by the customer over their previous transactions, is missing values when the customer has never shopped with us before. An example of the data would be:
Y prev_customer average_spend sale_method gender
1 10 FALSE NA offline ...
2 100 TRUE 100 online ...
3 10 FALSE NA offline ...
4 100 TRUE 100 online ...
I would like a splitting rule for average_spend
to allocate all missing values to one of the nodes. This is because, intuitively, I feel like the tree should be able to handle the dichotomy between whether the customer is a first time customer or not, i.e. if it doesn't split first on whether a customer is a first time customer, then this split will be useful later. In this sense, would it be ok to impute any value into the average spend?
I am unable to find such a rule in any of the references I have seen though.