How should you fit a decision tree with missing values in the predictors when missingness can be derived from other variables?

Question

I have a very specific situation involving missing data in a regression tree (actually part of a random forest) which is not covered by the most popular related questions:

How do decision tree learning algorithms deal with missing values (under the hood)

Why doesn't Random Forest handle missing values in predictors?

Here is a contrived example where the response Y is the expenditure of a customer at their next transaction. The average_spend variable, which measures the average spent by the customer over their previous transactions, is missing values when the customer has never shopped with us before. An example of the data would be:

    Y prev_customer average_spend sale_method gender
1  10         FALSE            NA     offline    ...
2 100          TRUE           100      online    ...
3  10         FALSE            NA     offline    ...
4 100          TRUE           100      online    ...

I would like a splitting rule for average_spend to allocate all missing values to one of the nodes. This is because, intuitively, I feel like the tree should be able to handle the dichotomy between whether the customer is a first time customer or not, i.e. if it doesn't split first on whether a customer is a first time customer, then this split will be useful later. In this sense, would it be ok to impute any value into the average spend?

I am unable to find such a rule in any of the references I have seen though.

A suggestion in a different direction: Why don't you use a package like xgboost? I think it handles NAs in the way you want (not 100% sure about the underlying implementation but when one prints the tree then there are NA bins explicitly mentioned)... I was thinking about setting them to some weird value like -1 which is not a good idea because a split like avg_spend<50 will always include all of them. However, when replacing the avg_spend by the 'average avg_spend' should do fine, no? If the tree decides that it is a good idea to split by prev_customer then it will automatically... — Fabian Werner, May 08 '18 at 09:22
ignore average_spend on the new customer side of the split as average_spend is just constant and will never be used again to split. However, there is the possibility that the tree will decide to split on average_spend before it splits on prev_customer... — Fabian Werner, May 08 '18 at 09:23
The situation is you describe after imputation by the average is what I fear will happen. I'll have a look and see whether I can work out the algorithm in xgboost, I assume you can fit unboosted trees. Essentially, you can also replace NA's with +Inf or -Inf, just that you don't know which replacement is the best until you build the tree. Maybe you could duplicate the variable, but have +Inf in place of NAs in one, and -Inf in the other. — Alex, May 09 '18 at 01:54
I think both ways (+ and - Inf) do kot resemble what you actually want. You want to handle NA special while those solutions will make the tree include all of them for a normal split... anyhow: if you are not satisfied with imputation by mean: why don’t you use some fancy EM stuff (r package Amelia). After all I personally view NA imputation as a hyperparameter (as you said): one cannot know which method works best before one has tried them. However, since you have a clear idea about how you want to handle NA: shouldn’t you build two separate models in the first place? — Fabian Werner, May 09 '18 at 10:21
I.e. the future spending of a customer will be closely related to its prior spending behavior. However, in the case you do not have any prior information you should not attempt to treat them using the same model/features/... Maybe you should even exclude them from the whole problem: what article should you reasonably recommend if you do not know anything about the customer? — Fabian Werner, May 09 '18 at 10:23
How would imputation by +Inf and -Inf not do what I want? I thought the tree would either choose to split by the +Inf variable, or the -Inf variable, if either of these contain enough information to make a good split. The example here is contrived but in principle the missingness might not be as clear cut. For example, we might have previous spend but missing values for `Sex`: it may still be useful to split on Sex, without dropping NA values, and then split on previous spend. — Alex, May 10 '18 at 02:21
I was thinking about it in a "philosophical" way. This also depends on how the library orders numbers but I guess that -Inf < x for every real x makes sense. So when the tree decides to do some split then if you impute by -Inf, one of the sides (average_spend < spitValue) will contain all NA rows. The tree can't do anything about it and it cannot seperate the NA things from the 'real' data rows except for making the split at a level so low that it excludes all values exept for -Inf. However I guess that the prediction will depend on different features when it is a new customer than when it is — Fabian Werner, May 10 '18 at 09:34
a known one. That is why I would at least test training two different models: one for known customers that have a prior spending behaviour and one model for new customers. — Fabian Werner, May 10 '18 at 09:35
> "So when the tree decides to do some split then if you impute by -Inf, one of the sides (average_spend < splitValue) will contain all NA rows." Correct, and if this is not the best split, it can use the +Inf variable to put NA rows whenever average_spend > splitvalue. I am not just trying to separate out NA's from no NA's. I want NA's to be included in one branch, instead of being discarded. — Alex, May 10 '18 at 23:29

How should you fit a decision tree with missing values in the predictors when missingness can be derived from other variables?

0 Answers0