How deep is the connection between the softmax function in ML and the Boltzmann distribution in thermodynamics?

Question

The softmax function, commonly used in neural networks to convert real numbers into probabilities, is the same function as the Boltzmann distribution, the probability distribution over energies for en ensemble of particles in thermal equilibrium at a given temperature T in thermodynamics.

I can see some clear heuristical reasons why this is practical:

No matter if input values are negative, softmax outputs positive values that sum to one.
It's always differentiable, which is handy for backpropagation.
It has a 'temperature' parameter controlling how lenient the network should be toward small values (when T is very large, all outcomes are equally likely, when very small, only the value with the largest input is selected).

Is the Boltzmann function only used as softmax for practical reasons, or is there a deeper connection to thermodynamics/statistical physics?

I don't see why this is attracting close votes--it's a perfectly reasonable question. — Matt Krause, May 25 '18 at 16:59
+1 to @MattKrause—NNs are surely on-topic, as is—I think—statistical physics. — Sean Easter, May 25 '18 at 17:04
I can see how the question is more 'open' than most SO questions, in the sense that I'm not looking for a solution to a problem, but more general knowledge. However, I couldn't think of a better place to ask it or a more specific way to ask it. — bjarkemoensted, May 25 '18 at 18:13
the thing that connects thermodynamics to pure statistics is information theory. i am certain that if you think carefully about the entropy of a system described by the boltzmann distribution, and think about exactly how the cross-entropy is minimized in ML classification problems that use a final softmax activation, there is a deeper connection that has to do with the equivalence of thermodynamic entropy and information entropy. the area of energy-based out-of-distribution detection comes to mind too; there are certainly deeper connections here that require deeper thinking and exploration. — rajb245, Feb 16 '22 at 21:05

cherub · Answer 1 · 2018-05-29T07:34:33.300

To my knowledge there is no deeper reason, apart from the fact that a lot of the people who took ANNs beyond the Perceptron stage were physicists.

Apart from the mentioned benefits, this particular choice has more advantages. As mentioned, it has a single parameter that determines the output behaviour. Which in turn can be optimized or tuned in itself.

In short, it is a very handy and well known function that achieves a kind of 'regularization', in the sense that even the largest input values are restricted.

Of course there are many other possible functions that fulfill the same requirements, but they are less well known in the world of physics. And most of the time, they are harder to use.

score 3 · Answer 2 · answered Apr 05 '19 at 14:58

the softmax function is also used in discrete choice modelling, it is same as the logit model, if u assume there is a utility function associated with each class, and the utility function equals to the output of neural network + an error term following the Gumbel distribution, the probability of belonging to a class equals to the softmax function with neural network as input. See: https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf

there is alternatives to the logit model, such as the probit model, where the error term is assumed to follow standard normal distribution, which is a better assumption. however, the likelihood would be intractable and is computational expensive to solve, therefore not commonly used in neural network

How deep is the connection between the softmax function in ML and the Boltzmann distribution in thermodynamics?

2 Answers2