Why does LightGBM Classifier gives some folks a probability of 1 of belonging in a class with log-loss?

Question

I'm trying to use the LightGBM package in python for a multi-class classification problem and I'm baffled by its results.

For a minority of the population, LightGBM predicts a probability of 1 (absolute certainty) that the individual belongs to a specific class.

I am explicitly using a log-loss function, so if the algorithm is wrong with even one of these folks, my loss will be infinite.

I tried tweaking the parameters, changing the features, switching the boosting to random forest, etc. but it seems impossible to avoid this result.

Strangely enough this issue appears specific to LightGBM: I tried other packages like XGBoost, CatBoost, H20, etc. and they all provide probabilities that exclude 0 and 1.

Is there something I'm missing? Maybe a parameter I'm not setting right?

Or, maybe it is a bug with LightGBM?

Example:

param = {'objective': 'multiclass', 'metric': 'multi_logloss', 'num_class':21}
num_round = 20
model = lightgbm.train(param, train_data, num_boost_round=num_round)

preds = model.predict(X_test[features])

sum(sum(preds == 1))

Results: 70 individuals have one of their probabilities set to 1.

You might want to add an L1 or L2 penalty to avoid this behaviour. LightGBM has a couple of parameters to tune. I'd suggest to choose good values for the most important parameters using cross-validation. — Michael M, Aug 24 '20 at 15:53
can you output all the default parameters of lightgbm (vs xgboost) — seanv507, Aug 24 '20 at 16:06
@MichaelM thank you! L1 & L2 regularization dramatically improves the performance of my model. — Guillaume F., Aug 26 '20 at 01:39

Sycorax · Answer 1 · 2020-08-31T17:52:32.513

Floating point arithmetic is has limited precision. The standard inverse logistic (aka "sigmoid") function $\sigma(z)=\frac{1}{1+\exp(-z)}$ is used to compute probabilities from the real-valued scores $z$. For $z$ very small, the effect is to lose digits of precision so that the value 1 is returned. For $z$ very large, we have rounding to 0.

My hypothesis that predicted probabilities of exactly 1 originate in floating-point roundoff is consistent with OP's remark in comments that using L1 or L2 regularization does not cause predicted probabilities to be exactly 1. This is because L1 and L2 regularization shrink the leaf weights; it happens that whatever penalty OP used, the shrinkage effect is sufficient to yield values of $z$ which do not become so large that numerical roundoff in $\sigma(z)$ yields 1.

This loss of precision is only a real problem if you need to distinguish between $1 - \epsilon$ and $1 - \frac{\epsilon}{2}$ etc. If this is the case, then you could work on the scale of the scores $z$, instead of probabilities, without losing precision.

Other machine learning algorithms don't have this problem, though. Why is it limited to LGBM? — Guillaume F., Aug 26 '20 at 01:39
One hypothesis is that LGBM is using a different precision. What precision is `preds`? Another is that the LGBM model is, somehow, different from the xgboost or CatBoost models; perhaps the differences in tree construction and weight computations accumulates to floating point roundoff error. Are the ensembles identical? — Sycorax, Aug 26 '20 at 01:47

Why does LightGBM Classifier gives some folks a probability of 1 of belonging in a class with log-loss?

1 Answers1