Smoothing/shrinking the predicted probability of a classifier to reduce live logloss

Question

Let us assume we work on a 2 -class classification problem. In my setting the sample is balanced. To be precise it is a financial markets setting where up and down have approximately 50:50 chance. The classifier produces results $$p_i = P[class = 1|X_i].$$

We evaluate the model by logloss on unseen/live data by $$ logloss = - \frac1n\sum_{i=1}^n \left(1_{o_i=1}\log(p_i) + (1-1_{o_i=1})\log(1-p_i) \right), $$ where $1_{o_i=1}$ denote the inficator that observation i equals 1. Wrong extremes are heavily punished. The values $p_i = 1/2$ can be seen as neutral predictions.

Given the fact that I might face data shifts out-of-sample can I define a smoothed or shrunken version of my predictions $(p_i)_{i=1}^n$ in order to reduce my out-of sample logloss that would result in overconfident wrong predictions?

Is there literature on this? The first thought would be to cut of probablities that deviate too much from 0.5 but I assume that there are better ways to do this. The model can be thought of a regularized logistic regression or neural net.

What do you mean by "taming" it? Sorry but it can mean anything. What do you mean by "a lot of uncertainty"? What do you want to accomplish? What exactly is the problem with your data and the classifier? Finally, what model do you use (loss function alone is not the model)? — Tim, Nov 27 '17 at 09:44
@Tim I changed the wording a bit. What I mean is that there could be a data-shift out-of-sample (a different regime). I could use a precise definition from the literature but I would look for an approach that works without the precise definition. The model: it should work for the usual ones (regularized logistic regression and neural nets). Should I edit the question further? The first thought is to cut of extremes. But maybe there is something better. — Richi W, Nov 27 '17 at 09:51
What do you mean by data-shift? Why would you cut the extreme probabilities? How would you like to do this? From your question it is hard to see what and why do you want to do. — Tim, Nov 27 '17 at 10:01
@Tim by data shift I mean something like this: https://www.quora.com/In-machine-learning-what-is-dataset-shift I do not want to define it too strictly as I can not be sure of the true model of the shift. If you look at logloss then it is bad to be wrong and confident. Then your logloss will be high. with confident I mean anything that exceeds 0.5 by too much. If I get a prediction of 0.6 then being wrong on this would be very bad so I could cut it down to 0.52 ... or even close to 0.5. Leaving the prediction at 0.6 would be risky. Do you think I it would be enough to add this to the post? — Richi W, Nov 27 '17 at 10:05
You seem to be trying to describe your idea for the solution of the problem and asking if it is correct without describing the problem -- such question wouldn't be answerable... You really should try telling us what exactly changes about your data, how does this affects your model and why is it a problem. — Tim, Nov 27 '17 at 10:18
@Tim isn't there always the problem that you train on some data and live data will be different? In some cases it can be "more different". The domain is financial markets and the procedure is: calibrate on some data, predict live data and do not risk $|p-1/2|$ too large because then logloss would be bad if you are wrong. — Richi W, Nov 27 '17 at 11:17
Sure it is, but it can be different in many ways and there is no one-size-fits-all solution like shrinking everything by some arbitrary constant. If you ask about the solution, you first need to tell us what problem does it solve. I can't see why if your "new" dataset differs from the one that you used for training your model, "shrinking" probabilities could help in here, or even give you valid results. — Tim, Nov 27 '17 at 11:26
Moreover, if you want to use a loss that does not penalize the extreme probabilities too much, then simply do not use log-loss but something different. — Tim, Nov 27 '17 at 11:28

score 1 · Answer 1 · answered Apr 05 '18 at 17:21

1

You can try a simple smoothing - if your predicted probability is Pi, and you know the prior (0.5) then Pi -> Pi*(1-alpha) + prior*alpha

Where alpha is a smoothing parameter between 0 and 1. Similar to Laplace smoothing.

answered Apr 05 '18 at 17:21

Dan

11
1

Smoothing/shrinking the predicted probability of a classifier to reduce live logloss

1 Answers1