0

I've added new attributes to the binary GLM model. AUC climbed to 98%, logistic loss decreased to 0.45. Training set has ~50 cases.

I can see that predicted probabilities are extremely close to 0 and 1 (max f2 threshold from 8-fold cross validation is very close to 1)

My model has 11 attributes including g1. Below are the model predictions for changed g1 value (g1.val) and fixed values of the other attributes:

+---------+--------+---------------+----------+--------+
| predict | p.NORM |          p.C1 |   StdErr | g1.val |
+---------+--------+---------------+----------+--------+
| NORM    |      1 | 2.200799ep-37 | 445.9396 |     19 |
| NORM    |      1 | 3.609197e-37  | 452.2013 |     20 |
| NORM    |      1 | 5.918897e-37  | 459.9089 |     21 |
+---------+--------+---------------+----------+--------+

When I remove 2 new attributes (t1, t2) predictions (raw probabilities) of the modified model looks better:

+---------+-----------+--------------+----------+--------+
| predict |  p.NORM   |         p.C1 |   StdErr | g1.val |
+---------+-----------+--------------+----------+--------+
| NORM    | 0.999481  | 0.0005190334 | 3.068864 |     19 |
| NORM    | 0.9991453 | 0.0008547492 | 2.96949  |     20 |
| NORM    | 0.9985927 | 0.001407304  | 2.901418 |     21 |
+---------+-----------+--------------+----------+--------+

Predictions of the modified model are not very close to 0 and 1, but AUC decreased to 88% and logistic loss increased slightly.

What is the reason for such predicted probabilities change?

I suppose that t1, t2 are not more greatly important attributes than the other. Also I can't find description of StdErr for h2o.predict() results in documentation/examples.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
MLearner
  • 35
  • 5
  • 1
    overfitting, or you could be close to separation: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression – kjetil b halvorsen Feb 05 '20 at 15:17
  • If it's overfitted, is there a significant chance that the second (simpler) model is not (still having good c-v metrics)? – MLearner Feb 06 '20 at 14:40
  • Do you mean that you have a strong fit to the training data with 11 variables and a slightly weaker fit to the training data when we remove two of those 11 variables? – Dave Feb 06 '20 at 14:54
  • Yes, something like that (assuming training set is rather small). – MLearner Feb 06 '20 at 15:07

0 Answers0