Is decision threshold a hyperparameter in logistic regression?

Question

Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.

But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).

"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this. — Matthew Drury, Jan 31 '19 at 17:27
Very much related: [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) — Stephan Kolassa, Jan 31 '19 at 18:17
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent). — smci, Jan 31 '19 at 19:53

Sycorax · Answer 1 · 2019-02-01T15:34:34.797

The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.

The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.

The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV.

Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.

The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.

Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold? — Nick, Feb 01 '19 at 08:32
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters *changes the model* (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside `GridSearchCV`. — Sycorax, Feb 01 '19 at 13:49
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case. — amoeba, Feb 01 '19 at 16:16
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context. — Sycorax, Feb 01 '19 at 16:26

Matthew Drury · Answer 2 · 2019-02-01T16:52:57.640

But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?

Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.

If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).

This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.

I think this is one of the places sklearn got it wrong. The library includes a method, predict, on all classification models that thresholds at 0.5. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.

I also share your skepticism of the `predict` method's default choice of 0.5 as a cutoff, but `GridSearchCV` accepts `scorer` objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point? — Sycorax, Jan 31 '19 at 17:32
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds. — Matthew Drury, Jan 31 '19 at 17:32

Is decision threshold a hyperparameter in logistic regression?

2 Answers2

Linked