20

Recently I have reviewed XGBoost algorithm and I have noticed that this algorithm can handle missing data (without requiring imputation) in the training phase. I was wondering if XGboost can handle missing data (without requiring imputation) when it is used for forecasting new observations or it is necessary to impute the missing data.

Thanks in advance.

p_sutherland
  • 105
  • 4
Ricardo UES
  • 461
  • 1
  • 3
  • 8

1 Answers1

24

xgboost decides at training time whether missing values go into the right or left node. It chooses which to minimise loss. If there are no missing values at training time, it defaults to sending any new missings to the right node.

If there is signal in the distribution of your missings, then this is essentially fit by the model.

Be careful if your scoring data has its missing values distributed differently from your training data. xgboost's missing handling is convenient but doesn't protect against masking.

Source: this answer

Dex Groves
  • 1,593
  • 8
  • 12