F1 weighted vs. Log loss in SciKit learn RandomSearchCV

Question

I am sorry to ask another question regarding this topic but I am still puzzled about the following:

When I use 'F1_weighted' as my scoring argument in a RandomizedSearchCV then the performance of my best model on the hold-out set is way better than when neg_log_loss is used in RandomizedSearchCV. In both cases, the brier score is approximately similar (in both training and testing ~ 0.2). However, given the current emphasise on neg_log_loss scores as cost-function it almost feels like any other scorer would be plainly wrong to use (even when the performance is better). Is this true?

EDIT I

To give more background on the study and the sample size:

Goal is to find the top-predictors related to hospitalisation (this could be indicative of disease deterioration) --> These findings may guide new research on these predictors and are not directly intended as prognostic tool.
I have a dateset that is split in X_train, X_test, where the test set is only used for predictions after hyperparameter optimisation.
The classes are imbalanced (1/3 of the patients were not hospitalised and 2/3 were hospitalised)
The following classifiers were considered (logistic regression with lasso, SVC with RBF or poly kernel, RF, XGBoost)
RandomSearchCV with RepeatedStratifiedKFoldCV was used (10 folds, 5 repeats)
I am having a hard time deciding on a appropriate scorer to optimise during RandomSearchCV --> neg_log_loss or F1_weighted.
Would it make sense to bootstrap the test-set and check for the performance variance on the hold-out test set, instead of using "one" hold-out test set?

The imbalance was countered by putting class_weights during randomsearchCV and the train-test split were stratified on the basis of the class imbalance.

EDIT II

Overview of calibration measures:

F1_weighted in RandomizedSearchCV --> brier score = 0.19 (hold-out test set) & Log_loss = 0.76 (hold-out test set)

Neg_log_loss in RandomizedSearchV --> brier score = 0.2 (hold-out test set) log_loss = 0.58 (hold-out test set).

The log_loss in F1_weighted is way higher than in log_loss (optimalisation worked) but there is pretty much no difference in brier score?

One thing missing here is the probability cutoff that you used for setting up the confusion matrix for the XGBoost model. It looks like you might be able to improve your classification performance by a different choice of cutoff. Also, there's a danger in the way that you are approaching this: trying several different types of models with each of the criteria, then picking the one that looks the best. That runs a high danger of overfitting, finding a model that works well on these data but doesn't generalize well. Have you tried bootstrap validation of your modeling approach? — EdM, Feb 01 '21 at 15:33
Hi @EdM, Nice to talk to you go again. The cut-off by default is 0.5 if I am not mistaken but can't seem to find it in their documentation. I can always select my own predicted probability (> 0.6 = 1). But how would I otherwise need to compare different types of models (I ran logistic regression, SVC, RF and XGBoost on both scorers but just picked the seemingly best performing one on the basis of F1 and AUROC). I don't get the bootstrap validation point: Let's say I have 100 samples in the hold-out test set, would you then bootstrap these and evaluate the model on each of these? — JonnDough, Feb 01 '21 at 15:43
It's dangerous to evaluate a model on a single separate test set unless you start with several thousand samples. Otherwise the precision of the test-set performance estimate is low, could depend on the particular train/test breakdown, and might not generalize well to new data. See for example Frank Harrell's [blog post](https://www.fharrell.com/post/split-val/) and Section 5.3 of his [course notes](https://hbiostat.org/doc/rms.pdf). Please edit the question to add details on the overall size of your data sample and how you intend to use the results of your model; that will help with an answer. — EdM, Feb 01 '21 at 17:20
I tried to expand on the question (see Edit). I hope this additional information helps. I also added the question of the bootstrapping on the hold-out test set to get an idea about the variance on the test set. — JonnDough, Feb 01 '21 at 17:37

EdM · Accepted Answer · 2021-02-02T15:27:48.190

Goal is to find the top-predictors related to hospitalisation (this could be indicative of disease deterioration) --> These findings may guide new research on these predictors and are not directly intended as prognostic tool.

First, one hopes that the "top-predictors related to hospitalisation" won't differ much among these models. There's no need to choose just one model to reach that goal. Showing results of several models all pointing to the same predictors will strengthen your case.

If the models do differ substantially among top predictors, then there is a potentially serious problem. You would need to examine reasons why, and decide which of the models might best "guide new research on these predictors" based more on your understanding of the clinical situation.

The exception to the above would be the LASSO models. Predictors in clinical studies tend to be highly inter-correlated, and LASSO's choice of which of a set of correlated predictors to retain can be very sample dependent.

Second, with that goal in mind, the particular confusion matrix that appears to be best becomes less of an issue. The critical thing is to have well calibrated models. Calibration represents the quality of the model over the entire probability range, assessed by measures like log-loss or Brier scores.Then focus further work on the predictors that are both important in calibrated models and might be amenable to interventions to reduce avoidable hospitalizations.

Third, the differences in confusion matrices might be overstated in your examples. The XGBoost matrix is evidently based on a probability cutoff of 0.5, while a different cutoff might perform better. Also, and maybe more important, the matrices are based upon a single particular set-aside test set. There's a danger that a model appearing to be better with that test set might be worse with a different training/test split choice, and thus might not be the model that best represents the characteristics of the underlying population.

You can clarify these issues by resampling the data to validate the model-building process. You are already using re-sampling in cross-validation (within your training set) to choose hyperparameter values. Try using bootstrap resampling at an even higher level.

With data sets of less than several thousand cases, it's best to use the entire data set to build the model and then to test modeling performance via bootstrapping. Try the following for each of your model types and scoring functions, which shouldn't be too onerous with your automated approach and only a few hundred cases.

After building a model with the full data set,* take repeated bootstrap samples from your data. Perform your entire modeling process (including hyperparameter optimization) on each bootstrap sample as a training set, then use the full data sample as the test set. As bootstrapping mimics the process of taking your data sample from the underlying population, this process provides an estimate of how well your model-building process would work on multiple samples from that population.

Examine the performance of each of your model types and scoring functions over many bootstrap samples. Look at average performance over the bootstrap-based models as a measure of overall quality. Look at differences between training-set and test-set predictions to evaluate modeling bias. For models that do predictor selection or rank predictor importance, see how well those selections or rankings are maintained among the bootstrap-based models.

It's possible, after you have gone through that process, that the random forest model based on the F1_weighted score will still perform best by your criteria. But then you will have documented its superiority in a robust way that should help allay any theoretical concerns.** As your goal is to identify important predictors rather than to develop a prognostic tool, you might worry less about how you got to a model and worry more about how well the model is pointing you to critical predictors.

I'd recommend looking carefully at Frank Harrell's course notes for guidance. Although written from the perspective of regression models, the principles apply very generally. Chapter 5 is particularly relevant to assessment and validation of models, and Section 4.10 touches on ways to compare models.

*Two thoughts on your modeling per se. For one, your class imbalance isn't that bad. Your attempts to adjust for class imbalance might be making problems worse with some types of models. For another, make sure that this event-probability/classification model is appropriate for your data. If you are examining something like 60-day hospital readmissions, when you have complete data on all cases for the entire time span, that's OK. If you're looking at a longer time span over which all you can say is that some people haven't been hospitalized yet, you should be taking that into account with a time-based survival model that handles that "right-censoring" of times to events. There are random-forest, gradient-boosting, and several standard survival-regression implementations available for survival models, at least in R.

**In terms of those theoretical arguments, even though F1_weighted isn't a proper scoring rule, it overcomes one limit of the standard F1 score--ignoring true negatives--by calculating separate F1 scores for both classes as the positive class, then doing a weighted average. Log-loss puts a lot of weight on the extremes of the probability scale and might not always be the best choice among proper scoring rules. You could, for example, try to use the Brier score instead of log-loss as the criterion for model building to minimize problems at probability extremes.

Wow, thank you very much for your extensive answer. This is top-notch! I now understand the gain of the bootstrapping method much better. One additional question though: when I use log-loss optimisation during training then I obtain a lower brier score (on average over 50 folds) then when I optimise brier score (the order of difference is 0.02 in favour of log loss optimisation). Taking my top-predictor goal in mind, what could be regarded as relatively well-calibrated (< 0.2?). The same question applies to the log-loss score (currently around 0.6, is this regarded as sufficient)? Thank you! — JonnDough, Feb 02 '21 at 09:00
I added one extra edit (EDIT II) to show the brier score and log-loss score for both methods. Does the lower brier score for F1 again has something to do with the one hold-out test set? — JonnDough, Feb 02 '21 at 09:25
@Oorschelp there is no one-size-fits-all value for an acceptable log-loss or Brier score. See the discussion for example on [this thread](https://stats.stackexchange.com/q/71417/28500). I can't say whether the lower Brier score after F1-based modeling has to do with the single held-out test set. Bootstrap resampling can evaluate that. Don't focus on single top predictors; one hopes that there is a set of top predictors that is relatively similar for all reasonable models. Some predictors might need to be retained in a model to allow the influence of other predictors to be seen. — EdM, Feb 02 '21 at 15:22
@Oorschelp also, see what I added to the first footnote in an editing of my answer. Depending on your situation, you might need to evaluate this with a time-to-event survival model rather than the event-probability models you are currently using. — EdM, Feb 02 '21 at 15:29
Thank you so much for your extensive answers, it really makes a difference! This has given me lots to think/try. I might reply to this post later on, after I tried a few things to discuss any doubts. Is that okay with you? — JonnDough, Feb 02 '21 at 16:11
One thing isn't really clear to me w/r to bootstrapping: I wouldn't repeat the hyperparameter optimisation for every bootstrap resample right? Because this would take tremendously long. So how I see it right now is: 1. train model with hyperparameter optimisation with all data 2. Bootstrap all the data and evaluate the trained model (so fit the trained model just on the bootstrap training set and predict on bootstrap test set, store the performance). However, you also write that I would need to perform hyperparameter optimisation again? — JonnDough, Feb 02 '21 at 16:35
@Oorschelp you need to evaluate the entire modeling process, so yes you should do the hyperparameter optimization for each bootstrap re-sample. It might seem to take "tremendously long," but once it's done you have about as good a validation of your modeling process as is possible. To cut down on time, restrict this to the model types that seem most promising and think carefully about [which hyperparameters need optimization](https://towardsdatascience.com/random-forest-hyperparameters-and-how-to-fine-tune-them-17aee785ee0d). Please ask a new question to clarify further issues that arise. — EdM, Feb 02 '21 at 17:15

F1 weighted vs. Log loss in SciKit learn RandomSearchCV

1 Answers1