XGBoost can handle missing data in the forecasting phase

Question

Recently I have reviewed XGBoost algorithm and I have noticed that this algorithm can handle missing data (without requiring imputation) in the training phase. I was wondering if XGboost can handle missing data (without requiring imputation) when it is used for forecasting new observations or it is necessary to impute the missing data.

Thanks in advance.

score 24 · Answer 1 · edited May 23 '17 at 12:39

xgboost decides at training time whether missing values go into the right or left node. It chooses which to minimise loss. If there are no missing values at training time, it defaults to sending any new missings to the right node.

If there is signal in the distribution of your missings, then this is essentially fit by the model.

Be careful if your scoring data has its missing values distributed differently from your training data. xgboost's missing handling is convenient but doesn't protect against masking.

Source: this answer

XGBoost can handle missing data in the forecasting phase

1 Answers1

Linked