9

I'm recently working on CNN and I want to know what is the function of temperature in the softmax formula? and why should we use high temperatures to see a softer norm in probability distribution?

The formula can be seen below:

$$\large P_i=\frac{e^{\frac{y_i}T}}{\sum_{k=1}^n e^{\frac{y_k}T}}$$

StubbornAtom
  • 8,662
  • 1
  • 21
  • 67
Sara
  • 193
  • 5
  • 7
    Hi Sara, there are blind and visually impaired users of this site who interact with it using screen readers. The screen readers can't handle the equation in your screenshot. Please edit the post to include the equation as LaTeX. If it helps, we have some [resources on using LaTeX on Cross Validated](https://stats.meta.stackexchange.com/a/1605/155836). – Arya McCarthy Jun 02 '21 at 21:47

2 Answers2

19

The temperature is a way to control the entropy of a distribution, while preserving the relative ranks of each event.


If two events $i$ and $j$ have probabilities $p_i$ and $p_j$ in your softmax, then adjusting the temperature preserves this relationship, as long as the temperature is finite:

$$p_i > p_j \Longleftrightarrow p'_i > p'_j$$


Heating a distribution increases the entropy, bringing it closer to a uniform distribution. (Try it for yourself: construct a simple distribution like $\mathbf{y}=(3, 4, 5)$, then divide all $y_i$ values by $T=1000000$ and see how the distribution changes.)

Cooling it decreases the entropy, accentuating the common events.

I’ll put that another way. It’s common to talk about the inverse temperature $\beta=1/T$. If $\beta = 0$, then you've attained a uniform distribution. As $\beta \to \infty$, you reach a trivial distribution with all mass concentrated on the highest-probability class. This is why softmax is considered a soft relaxation of argmax.

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
3

Temperature will modify the output distribution of the mapping.

For example:

  • low temperature softmax probs : [0.01,0.01,0.98]

  • high temperature softmax probs : [0.2,0.2,0.6]

Temperature is a bias against the mapping. Adding noise to the output. The higher the temp, the less it's going to resemble the input distribution.

Think of it vaguely as "blurring" your output.

Conic
  • 231
  • 2
  • 3