The softmax function, commonly used in neural networks to convert real numbers into probabilities, is the same function as the Boltzmann distribution, the probability distribution over energies for en ensemble of particles in thermal equilibrium at a given temperature T in thermodynamics.
I can see some clear heuristical reasons why this is practical:
- No matter if input values are negative, softmax outputs positive values that sum to one.
- It's always differentiable, which is handy for backpropagation.
- It has a 'temperature' parameter controlling how lenient the network should be toward small values (when T is very large, all outcomes are equally likely, when very small, only the value with the largest input is selected).
Is the Boltzmann function only used as softmax for practical reasons, or is there a deeper connection to thermodynamics/statistical physics?