Why we use hard targets (generated using the Softmax) but not soft targets or logits

Question

I was reading about knowledge distillation (in student-teacher networks, here) and it is stated that:

Advantages of Soft Targets:

Soft targets contain valuable information on the rich similarity structure over the data, i. e. it says which 2 looks like 3’s and which looks like 7’s.

Provides better generalization and less variance in gradients between training examples.

Allows the smaller Student model to be trained on much smaller data than the original cumbersome model and with a much higher learning rate

The formula to calculate soft targets is

$$ q_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} $$

when T=1, soft target is the same as hard target (softmax). I have confusion after reading this part. Why we do not use soft targets in traditional (not only in knowledge distillation) networks such as a general CNN for classification for example. What is the advantage of using hard target over the soft target in the most of the networks that we use?

Another kind of soft target/label smoothing is examined here: https://stats.stackexchange.com/questions/196670/targets-of-0-1-0-9-instead-of-0-1-in-neural-networks-and-other-classification-al/354504#354504 — Sycorax, Jan 06 '22 at 14:38

Tim · Answer 1 · 2022-01-06T13:37:02.203

3

You can use soft targets, but you would need to have them. The soft target for classification would be the probability that the sample belongs to a particular class, this is usually not something that you could observe. It is even available in the R’s implementation of logistic regression.

The scenario where you would have the soft targets is when using aggregated data, where row of dataset is an aggregate of multiple observations where the label is the fraction of the observations that belonged to the particular category. By the way, this nicely shows how logistic regression is a regression, not classification, algorithm and how the boundary between regression and classification can be blurry.

In model distillation you have the soft targets, because they are the probabilities predicted by the base model.

edited Jan 06 '22 at 13:37

answered Jan 06 '22 at 13:26

Tim

108,699
20
212
390

1

@MasA I don't understand what you mean. In traditional classification, you only have hard labels. You use softmax because you want the algorithm to predict probabilities of belonging to each class. It has nothing to do with soft targets. – Tim Jan 06 '22 at 14:14
thats my fault, I confused the soft target and soft max. I just edited my comment. Please consider the edited one. thank you. – Mas A Jan 06 '22 at 14:29
"But suppose that I have logits = [1, 2, 8] so one value for each class. Why we use softmax in traditional CNN's but not soft targets which are more informative as far as I understood? " – Mas A Jan 06 '22 at 14:29
1

@MasA How would you compute the loss using the logits directly, without softmax? What would you compare the logits against? – Sycorax Jan 06 '22 at 14:40
Assume that my logits are 1, 2, 7. Then I did a normalization to range these values between 0 and 1 so that the vector that I will compare with my one hot encoding target is 0.1, 0.2, 0.7. I will compare [0.1, 0.2, 0.7] with true label [0, 0, 1] @Sycorax – Mas A Jan 06 '22 at 15:51
@MasA This proposal doesn't work because logits can be negative or 0, so this scheme isn't guaranteed to produce valid probabilities (or even valid *numbers*). That's explained in https://stats.stackexchange.com/questions/419751/why-is-softmax-function-used-to-calculate-probabilities-although-we-can-divide-e/419753#419753 Note that if you combine softmax and categorical cross-entropy loss into one operation (just simplify the algebra), then you can work with logits directly, but the model is the same working with the predicted probabilities. – Sycorax Jan 06 '22 at 15:53
@MasA why would you do that? This would only make things *harder* for the algorithm. After softmax, the values are already on the proper scale, without it you are forcing the model to learn that although you allow the whole $-\infty$ to $\infty$ range, the proper values lie only between 0 and 1. There is no good reason why you would do that. – Tim Jan 06 '22 at 23:07

Why we use hard targets (generated using the Softmax) but not soft targets or logits

1 Answers1