6

I am reading "Bayesian Knowledge Tracing" model fitting process. Model details can be found here. In short, it is a revised Hidden Markov Model applied on education application.

I have some questions about the code posted by the author here (from Columbia University), Where it seems the author uses squared loss on probability to check how good the fit was. In the attached documents the author says:

LikelihoodCorrect is calculated, for each student action. After that LikelihoodCorrect is subtracted from the studentAction and squared to get the Squared residual (SR) and then SR is summed to get SSR.

$$\text{likelihoodcorrect}=(\text{prevL} * (1.0-\text{Slip}))+ ((1.0-\text{prevL})* \text{Guess})$$ $$\text{SSR} +=(\text{StudentAction}-\text{likelihoodcorrect})^2$$

(In the data file the author provided, student action is a binary variable. So, this is 0 or 1 minus predicted probability, then squared.)

Should we use logistic loss instead?, which is

$$ y\log(p)+(1-y)\log(1-p) $$

instead of

$$(y-p)^2$$

Why there are many publications using squared loss on binary variable instead of logistic loss? Such as this paper by Carnegie Mellon University, page 7 end of section 3.

All of the models were cross-validated using 10 randomly assigned user-stratified folds. For each of the cross-validation results we computed root mean squared error (RMSE) and accuracy (number of correctly predicted student successes and failures).

David J. Harris
  • 11,178
  • 2
  • 30
  • 53
Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • Why do you think squared loss is wrong? If the goal is to predict a binary value and you are outputting a probability, squared loss seems reasonable to me. Maybe I am misunderstanding something? – user3494047 Mar 01 '17 at 03:00
  • @user3494047 I worked with classification loss for a long time, never seen use squared loss this way. See this post http://stats.stackexchange.com/questions/222585/what-are-the-impacts-of-choosing-different-loss-functions-in-classification-to-a – Haitao Du Mar 01 '17 at 03:17
  • I did not read it, but two things: (I am not too familiar with HMMs but) For HMM's I believe you don't solve it using gradient descent. That is you never need to take the gradient of your objective or cost function. So you do not care about convexity. So yes I agree with you that classification loss is probably what you care about most BUT since you are outputting a probability (and not 0 or 1) it makes sense to check at every test instance how close your probability was to being correct. That is exactly what squared loss does. – user3494047 Mar 01 '17 at 15:02
  • 2
    But also remember: loss is used for two (almost independent reasons). One is to define an objective function which a process then tries to minimize. Another reason is to present a report/analysis of performance of a predictor. For this it is common to present a few different kinds of loss, because each loss says something a little bit different and might be more relevant to some users than others. In this case squared loss is still relevant in that it tells you how "close" your probabilities are to being correct, on average. – user3494047 Mar 01 '17 at 15:05
  • @user3494047 Thanks for you message. I understand you points. If we want to check how close it is, for example, why not use absolute deviation? I still feel using squared loss is strange in this case. In statistics people have many methods to compare the prob. numbers, i never seen using squared loss. – Haitao Du Mar 01 '17 at 15:18

1 Answers1

11

Squared loss on binary outcomes is called the Brier score. It's valid in the sense of being a "proper scoring rule", because you'll get the lowest mean squared error when you use the correct probability. In other words, logistic loss and squared loss have the same minimum.

This paper compares the properties of the Brier score ("square loss") to some other loss functions. They find that square loss/Brier score converges more slowly than logistic loss.

Square loss has some advantages that might compensate in some cases:

  • It's always finite (unlike logistic loss, which can be infinite if $p=1$ and $y=0$ or vice versa)
  • It accelerates as the size of the errors increases (so it's less likely to allow any wildly inaccurate predictions to slip through, compared to accuracy and absolute loss)
  • It's differentiable everywhere (unlike hinge loss and zero-one loss)
  • It's the most commonly implemented loss in software packages, so it might be the only option in some cases
David J. Harris
  • 11,178
  • 2
  • 30
  • 53
  • 2
    Thanks for your reply, I understand squared loss is a "valid" choice. But feel using other loss may be better, such as logistic or least absolute deviation. Since we are not using derivative of squared loss, and square may have problem with outliers. – Haitao Du Mar 03 '17 at 15:32
  • 1
    Okay. I added some advantages of squared loss. Note that logistic loss can be even more sensitive to outliers (see the first bullet point) – David J. Harris Mar 03 '17 at 16:10
  • 3
    @hxd1011 In my opinion, the single most important difference between logistic loss and squared error is the first bullet point above. Should a predicted probability of 1 when the event fails to happen (or vice versa) be regarded as infinitely bad? You could argue either way. Choose logistic loss if you want it to be infinitely bad and squared loss if you want it to be only finitely bad. – Kodiologist Mar 03 '17 at 16:20
  • @Kodiologist absolutely agree. – David J. Harris Mar 03 '17 at 17:28
  • is the brier score just mean squared error? – Maths12 Nov 19 '20 at 17:47
  • Very much related: [Why is LogLoss preferred over other proper scoring rules?](https://stats.stackexchange.com/q/274088/1352) – Stephan Kolassa Nov 17 '21 at 19:54