12

In several kaggle competitions the scoring was based on "logloss". This relates to classification error.

Here is a technical answer but I am looking for an intuitive answer. I really liked the answers to this question about Mahalanobis distance, but PCA is not logloss.

I can use the value that my classification software puts out, but I don't really understand it. Why do we use it instead of true/false positive/negative rates? Can you help me so that I can explain this to my grandmother or a newbie in the field?

I also like and agree with the quote:

you do not really understand something unless you can explain it to your grandmother
-- Albert Einstein

I tried answering this on my own before posting here.

Links that I did not find intuitive or really helpful include:

These are informative, and accurate. They are meant for a technical audience. They do not draw a simple picture, or give a simple and accessible examples. They are not written for my grandmother.

EngrStudent
  • 8,232
  • 2
  • 29
  • 82

1 Answers1

9

Logloss is the logarithm of the product of all probabilities. Suppose Alice predicted:

  • with probability 0.2, John will kill Jack
  • with probability 0.001, Mary will marry John
  • with probability 0.01, Bill is a murderer.

It turned out that Mary did not marry John, Bill is not a murderer, but John killed Jack. The product of the probabilities, according to Alice, is 0.2*0.999*0.99=0.197802

Bob predicted:

  • with probability 0.5, John will kill Jack
  • with probability 0.5, Mary will marry John
  • with probability 0.5, Bill is a murderer.

The product is 0.5*0.5*0.5=0.125.

Alice is better predictor than Bob.

user31264
  • 1,694
  • 10
  • 14
  • 1
    why does "product of all probabilities" work? This sounds like a relative of expectation maximization. – EngrStudent Apr 20 '16 at 21:23
  • 3
    Do you need a formal proof? It is in the "technical answer" mentioned by the topicstarter. Do you need an informal "grandmother" reason why? You say: suppose this fellow gave correct predictions. What is the probability that everything happen as it really happened? This is the product of probabilities. – user31264 Apr 20 '16 at 22:55
  • 1
    "product of probabilities" isn't "grandma". log of product of probabilities is sum of log-probabilities, which they use in expectation maximization and call "expectation". I think it is also encoded in K-L divergence. ... I think in grandma-talk you could say" "most likely" = highest overall probability of multiple events. There are two get "highest": 1) maximize the combined probability or 2) minimize the negative combined probability. Most machine learning likes "gradient descent" or minimizing badness. Log-loss is the negative probability scaled by sample size, and it gets minimized. – EngrStudent Jun 30 '17 at 13:11
  • Here [link](https://stats.stackexchange.com/questions/113301/multi-class-logarithmic-loss-function-per-class) they say "exp(-loss) is average probability of correct prediction." – EngrStudent Jun 30 '17 at 13:16
  • I liked the Bishop ref [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html). It is equation 4.108 and is the cross-entropy error function. – EngrStudent Dec 08 '17 at 16:22