Log probabilities in reference to softmax classifier

Question

In this https://cs231n.github.io/neural-networks-case-study/ why does it mention "the Softmax classifier interprets every element of ff as holding the (unnormalized) log probabilities of the three classes."

I understand why it is unnormalized but not why it is log? What does a log probability mean?

Why not just say unnormalized probabilities?

shimao · Answer 1 · 2017-07-08T00:15:37.460

There is a difference between probabilities and log probabilities. If the probability of an event is 0.36787944117, which happens to be $1/e$, then the log probability is -1.

Therefore, if you are given a bunch of unnormalized log probabilities, and you want to recover the original probabilities, first you take the exponent of all your numbers, which gives you unnormalized probabilities. Next, you normalize them like usual. Mathematically, this is

$$p_j = \frac{e^{z_j}}{\sum_i e^{z_i}}$$

where $p_j$ is the probability of the $j$th class and $z_i$ is the inputs to the softmax classifier.

The obvious question is why bother performing doing exponents. Why not use

$$p_j = \frac{z_j}{\sum_i z_i}$$

instead?

One reason for this is because the softmax plays nicely with cross-entropy loss, which is $-E_q[\log p]$, where $q$ is the true distribution (the labels). Intuitively, the log cancels out with the exponent, which is a very helpful for us.

It turns out that if you take the gradient of the cross-entropy loss with respect to the inputs to the classifier $\vec z$, you get $$\vec p - 1_j$$

when the ground truth label is in class $j$ and $1_j$ is the corresponding one-hot vector. This is a very nice expression and leads to easy interpretation and optimization.

On the other hand, if you try to use unnormalized probabilities instead of unnormalized log probabilities, you end up with the gradient being

$$\frac{1}{\sum_i z_i} - \vec 1_j^T\frac{1}{z}$$

This expression is much less nice in terms of interpretability and you can also see potential numerical problems when $z$ is close to 0.

Another reason to use log probabilities can be seen from logistic regression, which is simply a special case of softmax classification. The shape of the sigmoid function works well because, intuitively, as you move across the feature space, the probability of classes does not vary linearly with the inputs. The sharp bend in the sigmoid function, which emphasizes the sharp boundary between two classes, is really a result of the exponential term we are applying to the inputs of softmax.

Where is the log in the expression of unormalized log probabilities? — Abhishek Bhatia, Jul 24 '17 at 03:11
The log comes from the fact that $\log p_j \propto z_j$. The log of the probabilities is the log probability. Since in my post, I was going the opposite direction -- log probabilities to probabilities, I used exp instead of log. — shimao, Jul 24 '17 at 16:15
`The obvious question is why bother performing doing exponents. Why not use` $p_j = \frac{z_j}{\sum_i z_i}$ `instead?` I would answer this as: because simply normalizing will not give you a probability distribution. yes, they will sum to 1, but not all of the obtained probability are guarenteed to be non-negative. — Fibo Kowalsky, Jul 29 '20 at 19:18

Log probabilities in reference to softmax classifier

1 Answers1