2

I am using a tree-based method (specifically, random forest) to model the quality of sunsets based on weather measurements. One feature available is the height of the clouds. When there are no clouds the data is set to 99999. It's my impression that keeping the values at 99999 (or setting them to 0 or -999) will bias the predictions, as a tree will consider the 99999 real physical values when they should really be effectively ignored. I've considered adding a dummy variable to indicate whether there are clouds or not, but if I want to include cloud height, which I think could be relevant to the quality of sunsets, I feel like I'll need to do something with the 99999s. Is there an accepted way of handling this type of intentionally missing data with tree-based methods?

I've found a few questions related to this issue, but none have a solution to my problem:

Dummy variable method for missing data in ML/predictive models

How to deal with intentionally missing data

How should I define missing values due to skip questions in SPSS?

  • How do you define *quality of sunsets*? – kjetil b halvorsen Nov 11 '20 at 13:44
  • I have two metrics: human classification and number of Instagram posts with sunset related keywords. – Matt Stevans Nov 11 '20 at 15:28
  • So is a kind of *æsthetic* classification, subjective, but is it coded numeric? – kjetil b halvorsen Nov 11 '20 at 15:40
  • Yes. The user classification metric is a boolean (0 for poor and 1 for good). The numbers of Instagram posts are integers. – Matt Stevans Nov 11 '20 at 15:45
  • The way to do this is not to redefine the cloud height, but to create a binary variable so that it is true when the cloud height is 9999 and false otherwise. The random forest will then handle it appropriately. – EngrStudent Nov 11 '20 at 16:09
  • @kjetilbhalvorsen In my experience, at least with kids, humans make Great watches but they despise being recognized as such. Just because it is a nebulous aesthetic does not mean that it cannot be characterized and have some level of consistent predictability. Zen and the Art of motorcycle maintenance is an interesting read there. – EngrStudent Nov 11 '20 at 16:10
  • @EngrStudent I considered this as a possibility but was not 100% sure the random forest would handle it appropriately. I get that the binary variable will capture the difference, but I'm still concerned that any tree that uses the height and not the binary feature will be biased. Do any sources come to your mind that would explain this property of random forests in more detail? – Matt Stevans Nov 11 '20 at 16:51
  • 1
    @MattStevans - as long as you have a decent number of trees, and aren't using an RF version from 2005, this works. Each tree gets an incomplete set of rows and columns. They aren't the same for each CART. This means some trees are made with both, some with one, some with the other. Their outputs, from the diverse set of learners, makes for more robust generalization. You can also down-select important columns using RF with Boruta, then use those in an alternate learner, which I have done to good effect several times. – EngrStudent Nov 11 '20 at 21:08

1 Answers1

1

Many tree model implementations treat missing values separately: choose an optimal split among the non-missing values, then decide which path the missing values should go with. That gives the greatest flexibility, which may or may not be best, depending on the bias-variance tradeoff in the rest of your setup.

Note too that tree models (except for extremely-randomized trees) don't take into account the scale of the variables at all. All that matters is that 99999 is larger than all the other values for the feature. So using 99999 or -999 just enforce that those will be treated similarly to other large or small (resp.) values, rather than as either depending on the node as NAs would be treated. In your context, keeping 99999 might make sense: sufficiently high clouds aren't really in the way of a sunset?

See also:
(DS.SE) What is the difference between filling missing values with 0 or any othe constant term like -999?
How do decision tree learning algorithms deal with missing values (under the hood)

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15
  • 1
    Thanks, I'll look these links over. Your comment got me thinking and I found that 99999s for the height of the lowest layer of clouds present has a good sunset rate of 0% which is closer to the rate when the height is zero than, the max of ~8k. Maybe this suggests imputing the 99999s as 0 or -999 so they are split with the heights of zero. Unfortunately, I can't apply this imputing to the (higher) cloud layer height features which are imputed as 99999s when these layers are not present. The good sunset rate when these are 99999s is not zero so imputing them to 0 doesn't help in the same way. – Matt Stevans Nov 11 '20 at 16:44