Are there any mathematical features that an evaluation metric must have?

Question

I'm trying to optimize the hyperparameters of my model using the Bayesian approach with the hyperopt library. I have to code a loss to evaluate each iteration of the optimization, and the classic metrics are usually chosen, like

loss = 1 - accuracy

Now, since I want to consider both a good model on test data and a not-overfitted model, I came to define the loss as this

train_loss = 1 - train_f1_score
test_loss = 1 - test_f1_score
loss = test_loss * 10^{test_loss - train_loss}

where the test f1 is calculated based on the mean f1 on a 3-fold cross-validation. The idea is that the metric becomes higher with an overfitted model, even if the test score is good.

I have a doubt: am I missing some particular feature that a good evaluation metric needs to have?

It sounds like you are interested in the topic of [tag:scoring-rules] about which we have several questions. — Sycorax, Aug 27 '19 at 15:26

score 3 · Answer 1 · answered Oct 27 '20 at 21:39

3

Accuracy is a misleading KPI for predictive performance. Note that every criticism raised in that thread against accuracy applies equally to the $F1$ (or more generally, any $F\beta$) score.

As Sycorax comments, use proper scoring rules on probabilistic predictions. That is also my recommendation in the linked thread.

Contrary to user2672299, I'm all for cross-validation. (Note that you can of course work with probabilistic predictions and scoring rules in cross-validation.) I would just recommend that you keep a validation sample that you evaluate your final model on, because, as user2672299 notes, you can overfit "to a cross-validation".

As to particular features a good metric should have: it should reward calibrated and sharp probabilistic predictions. Accuracy does not. The tag wiki for our scoring-rules tag contains pointers to literature explaining this point and why it makes sense.

answered Oct 27 '20 at 21:39

Stephan Kolassa

95,027
13
197
357

1

Thanks for clarifying my answer. I am not against cross-validation. I am against not evaluating on a hold-out sample. The "validation sample" you mention is technically the wrong term. What you are referring to is the test sample. The validation samples are used in the cross-validation. The important take-away from my (and your post) is that test data has to be treated as non existent until you do your "final test" (e.g., before application, publication or any other reason you would require an unbiased estimate). – user2672299 Oct 29 '20 at 11:31
@user2672299: thanks. Note that [the meanings of "test" and "validation" sets are frequently flipped](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) (and to be honest, I find it more logical to use "validation" to refer to the unseen new data). It's just important to clarify what we mean by our terms. – Stephan Kolassa Oct 29 '20 at 12:31
I would recommend "development set" instead of "validation set" in case the terms confuse you (as suggested by Andrew Ng). – user2672299 Oct 30 '20 at 08:42
Thanks for the great answer! Just to understand, are you suggesting to use something like logloss as the performance metric for CV? Would it be good also with hyperparam optimization via CV? Second question: if I replace train/test loss with logloss in the above formula, does the overall "loss" has significance? – Matteo Felici Nov 04 '20 at 06:56
The log-loss is a proper scoring rule, so I'm very much in favor. You can use it for hyperparameter tuning, as well - but of course, you can overfit a proper scoring rule just as any other loss, so be careful. Yes, I would just recommend replacing accuracy by the log-loss. – Stephan Kolassa Nov 04 '20 at 06:59
Ok, so in order to manage the possible overfitting, could I use the above formula `loss = test_loss * 10^{test_loss - train_loss}` replacing train/test (or holdout) loss with log-loss? With this, the loss will increase if I have a big difference, even if the absolute `test_loss` is quite low. – Matteo Felici Nov 10 '20 at 13:58
That is a possibility. Of course, you can still overfit to this particular compound loss function - after all, you are again using your test sample. There is no proof against overfitting. – Stephan Kolassa Nov 10 '20 at 17:53

user2672299 · Answer 2 · 2020-10-29T11:48:29.800

0

What you are missing is the purpose of the evaluation metric and the "test" set (actually it is validation set). You are not allowed to use the "test" loss in your loss function, because than your "test" set is not an independent sample anymore.

If you use your "test_loss" in the training step your "test" error is confounded (i.e. worthless).

Therefore, what you are doing is using "test" data as training data.

edited Oct 29 '20 at 11:48

answered Aug 28 '19 at 15:48

user2672299

348
1
7

My bad, I mis-explained myself. That "test" loss is based on cross validation, so I'm using the average f1 on the holdout parts as "test", but the data used is entirely from the training set. – Matteo Felici Aug 29 '19 at 08:25
That does not change a lot because than your validation set is confounded. In this case you cannot choose a model based on independent validation data and you use your validation data as training data. The model with your loss function will overfit to your validation data. – user2672299 Aug 29 '19 at 08:46
But I have some "real" and unused test set to assess the model created with hyperparameter tuning. I have a training set, used in a 3-fold cross validation, and on the average of the holdout score, I choose hyperparameters. Then, I use the test set to evaluate the created model. What part am I missing? – Matteo Felici Aug 29 '19 at 09:07
The CV (or validation set) is used to select your best model. If you confound it you cannot choose the best model, because you overfit to your validation data. – user2672299 Aug 29 '19 at 09:16
With "best model" do you mean best hyperparameters for given ML algorithm? If that's the case, that's what I'm doing. I use CV (or validation) to evaluate models (hyperparameters) and choose the best ones. What do you mean with "confound"? – Matteo Felici Aug 29 '19 at 09:41
Basically you break the evaluation. – user2672299 Aug 29 '19 at 12:06
Because your test cannot report the best model without bias anymore. It will always overfit to the model that uses the CV test_loss data. – user2672299 Aug 29 '19 at 12:07
Your question is like. Why can i not use the training error to measure the performance of my model? I think you know the answer to this question. – user2672299 Aug 29 '19 at 12:08
If you use validation/test/CV_validation data for your train it becomes training data. – user2672299 Aug 29 '19 at 12:09
So what is the correct way to use the CV to tune hyperparameters? – Matteo Felici Aug 29 '19 at 12:18
During the training step the validation/test/CV_validation data is not available (as if it was not generated when you train the model). Than you use CV to tune hyperparameters correctly. – user2672299 Aug 29 '19 at 12:40
Sorry, but the CV data IS available during the training of the model since with 3-fold we train 3 different models on 3 different subsets of training set and evaluate them on the 3 different holdout subset of training set. The training and hyperparameter tuning processes go together. For CV I'm following a procedure similar to https://scikit-learn.org/stable/modules/cross_validation.html – Matteo Felici Aug 29 '19 at 12:51
I somewhat agree with this post but I cannot upvote it because I think if taken out of context it will do more bad than good. Yes, we might overfit the cross-validation but that's mostly a byproduct of a bad CV procedure. Can a 20x5-fold CV still overfit? Yes, but practically... not easily. Is it better than just keeping 20% of your data as a hold-out test set (given we do not have a very large dataset to begin with)? Maybe... but we will have no idea of the variability of our metric, plus it will lead to temptation to "tweak things a bit" while the repeated CV stop us doing that. – usεr11852 Oct 28 '20 at 18:00
Somewhat agree with you, but it is not what he is doing in his metric @usεr11852. He uses the CV validation set in his loss function and hence optimizes his ML model using this metric, which makes his CV validation set a training set. Everything that is a training set is not a validation set. Dunno how to make myself clearer. – user2672299 Oct 29 '20 at 11:45

Are there any mathematical features that an evaluation metric must have?

2 Answers2