Probabilistic prediction with specified utility function

Question

This question is related, but not the same as link. I have read a lot of posts here as well as a post from Frank Harell.

It is very clear to me that Accuracy is not a great metric to use, probability is usually where a statistician/data scientist should stop and a probability cutoff is not a hyperparameter because it's tied with the actual decision not the model itself(you don't have to retrain to model to change this).

Suppose we have a probabilistic prediction model and that we also know the utility function of the decision maker. Should we try to grid-search (or any other search) through hyperparameters and choose a probability cutoff that would maximize the utility function (on the training set to not introduce additional bias ofcourse)? This would lead to choosing the hyperparameters and their corresponding probability cutoff at the same time.

Or should we rather aim to produce a model (with specified hyperparameters) based on some other metrics (proper scoring rules, auroc, ...) and optimize the probability cutoff afterwards? Such a system would be more robust to changes in the utility function (yes that actually happens) but may lead to worse results given a certain utility function, right?

The former however would let us compare both classifiers and probabilistic models (on the validation set when we are picking the final model) directly without converting the classifiers into some sort of a probabilistic models(i.e. Platt's scaling for SVM, number of trees in the majority class for RF, ...).

If one were to use the latter approach (or the utility function isn't known in advance), is it considered best practice to come up with a metric one should optimize (during validation) that's based on general metrics such as AUROC, Brier score,... as well as the assumed utility function (based on limited domain knowledge)?

score 1 · Accepted Answer · answered Apr 29 '19 at 10:46

I would recommend optimizing a probabilistic model and then deriving decisions from it separately. (Note that there may be more than one decision that can be derived from the probabilistic prediction, even if the target variable is binary, so speaking of "the" threshold can be misleading.)

As you write, this allows for changes in the utility function. These changes don't even need to happen over time. Different decision makers may apply different utility functions to a single probabilistic prediction at the same time. For instance, I predict retail sales both for promotion planning (where we need the expectation of our forecasts) and for replenishment (where we need high quantiles). If we didn't separate the probabilistic forecast from the decision, we would need to train the entire decision support system (DSS) separately for both use cases. And the probabilistic predictions that implicitly underlie both DSS could be incoherent.

Another example for a useful decoupling between the probabilistic forecast and the decision would be stochastic inventory optimization along a supply chain, where different actors in the chain need different aggregate forecasts.

You write that this approach "may lead to worse results given a certain utility function". I don't think so. Either your probabilistic prediction is well-calibrated. Then the decisions derived from it will be optimal for the utility function. Or the forecast is not well-calibrated. In which case you have a better chance of detecting this if you do have an explicit probabilistic prediction whose calibration you can assess and improve, than if you just have one big DSS where you have a hard time understanding whether your suboptimal decisions come from bad predictions, or wrong utility functions.

So is the Brier score(or something similar) the only thing I should be optimizing during model selection/hyper-parameter optimization? — ExabytE, Jun 11 '19 at 15:05

Probabilistic prediction with specified utility function

1 Answers1