If evaluation set is the same as training set, why would the evaluation error be different from training error?

Question

I understand the use of evaluation set for parameter tuning and over-fitting in general. The examples in the evaluation set should be unseen and different from training set.

However, in the following toy CatBoost regression problem, in which I choose to set the evaluation to be identical with the training set, I do not understand why the evaluation error calculated deviate from the training error over iterations?

from catboost import CatBoostRegressor, Pool

cat_features = [0,]

train_data = [["b", 1,],
              ["b", 5,],
              ["e", 3,],
              ["c", 3,],
              ["d", 4,]]

train_labels = [1.,3.51, 0.43,0.45,0.55]
train_data = Pool(train_data, train_labels, cat_features=cat_features)
model = CatBoostRegressor(iterations=100, use_best_model=True, random_seed=1234)


model.fit(train_data, eval_set=train_data, verbose=True, plot=True)

The error rate for evaluation error (solid line) and training error (dotted line) over iterations are shown as follows: On the other hand, if I choose to have change the only categorical feature to be a numerical feature (replacing a=1, b=2, c=3 and so on) as follows, the issue disappears, i.e., the evaluation error matches with the training error over all the iterations.

cat_features = []

train_data = [[2, 1,],
              [2, 5,],
              [5, 3,],
              [3, 3,],
              [4, 4,]]

train_labels = [1.,3.51, 0.43,0.45,0.55]
train_data = Pool(train_data, train_labels, cat_features=cat_features)
model = CatBoostRegressor(iterations=100, use_best_model=True, random_seed=1234)


model.fit(train_data, eval_set=train_data, verbose=True, plot=True)

I do not understand. I suspect this may be due to some randomness in categorical encoding. Could someone please help?

What exactly differs? You didn’t give us example of what you’re describing, so it’s unclear what you mean. — Tim, Sep 21 '21 at 18:06

score 1 · Answer 1 · answered Sep 21 '21 at 19:44

It is probably the categorical encoding, yes; but it is not randomness.

CatBoost encodes its categoricals using a sliding target encoding. During training, for each row, the value is the average target among rows above itself with the same level. (There may be some smoothing or noise, but that's the general idea.) But at prediction time, the encodings need to be fixed (you don't have a target anymore): they are the global target encoding from the training set.

Your training scores are using the special catboost-encoding, while the evaluation scores are using the global target encoding.

If evaluation set is the same as training set, why would the evaluation error be different from training error?

1 Answers1