I was reading about knowledge distillation (in student-teacher networks, here) and it is stated that:
Advantages of Soft Targets:
- Soft targets contain valuable information on the rich similarity structure over the data, i. e. it says which 2 looks like 3’s and which looks like 7’s.
- Provides better generalization and less variance in gradients between training examples.
- Allows the smaller Student model to be trained on much smaller data than the original cumbersome model and with a much higher learning rate
The formula to calculate soft targets is
$$ q_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} $$
when T=1, soft target is the same as hard target (softmax). I have confusion after reading this part. Why we do not use soft targets in traditional (not only in knowledge distillation) networks such as a general CNN for classification for example. What is the advantage of using hard target over the soft target in the most of the networks that we use?