Softmax overflow AND underflow

Question

I am trying to use the typical Multinomial model:

$y_{i} \sim Multinomial(N, \theta_{1}, ...,\theta_{p})$

where each theta is defined as:

$\theta_{i}=\frac{e^{V_{i} }}{\sum_{j} e^{V_{j}}}$

I have an issue with the overflow of the exponentials. I know a solution is given here Softmax overflow which basically suggests to subtract the maximum $V$ from both the power of numerator and the denominator but then that also creates the underflow problem in my case and many thetas will become 0! Is there any better solution to correctly calculate the probabilities (thetas) without facing overflow and underflow?

Here is my data and what really happens when I want to use softmax and the logSumExp approach:

library(dplyr)
library(matrixStats)

df <- data.frame(id         = seq(1,11,1),
                 real_probs = c(0.009589089, 0.007534532, 0.001860207, 0.05084681, 
                                0.1630076, 0.04244464, 0.007857292, 0.3566669, 
                                0.1793573, 0.1698827, 0.01095301),
                 V          = c(2653.254, 2364.478, 521.0566, 14615.73, 
                                46912.33, 12705.54, 1837.695, 102773.2, 
                                51674.49, 48950.77, 3129.556))

df_new <- df %>% 
  mutate(total_V = sum(V),
         pred_probs_without_softmax = V/total_V,
         exp_V = exp(V), # To show the Overflow problem
         exp_new_V = exp(V - max(V)), # To show the Underflow problem
         logsumex = logSumExp(V),
         pred_probs_with_softmax = exp(V - logsumex))

or just sample using the logits $V_j$ directly, without the transformation — Sycorax, Nov 15 '21 at 20:12
@Aksakal Ah, I misremembered -- this method works directly on the *log probabilities*, not the logits. https://stackoverflow.com/questions/33738382/sampling-multinomial-from-small-log-probability-vectors-in-numpy-scipy — Sycorax, Nov 15 '21 at 20:20
What problem specifically does underflow create? In many such applications all it does is limit your precision to whatever it was before--that is, it's not a problem at all. — whuber, Nov 15 '21 at 21:19
You can do whatever you want, is your model. Exponential makes soft max look like a well known approach of multi nomial logit, witch is similar to Boltzmann distribution. So you’ll have a lot of ground under your model — Aksakal, Nov 15 '21 at 21:58
@whuber it makes most of the Vs zero and thus it's not possible to retrieve the predicted probabilities (thetas) for each class. — Amin Shn, Nov 15 '21 at 22:18
Is it really necessary to compute with probabilities that are smaller than $10^{-15}$? How exactly does your example fail? There shouldn't be any problem with the probabilities you list: they are all many, many of orders of magnitude larger than anything that will underflow. — whuber, Nov 15 '21 at 23:26
@whuber I have some significant probabilities around 16-17% and the biggest one being 35% and I can't retrieve them because of the overflow and underflows. Check the data again. — Amin Shn, Nov 15 '21 at 23:33
No, it simply cannot be the case that underflow is the problem here. Overflows and underflows just don't happen like this. *You* need to show us what is wrong--please don't insist that we run your program in order to figure out what's going on. If that's necessary, then your question is on the wrong site and should be moved to [SO]. — whuber, Nov 15 '21 at 23:34
@whuber I simply gave you the real probabilities and the fitted Vs by my model. Now how do you calculate the predicted probabilities for each class (id = 1,...,11) using the fitted Vs? — Amin Shn, Nov 15 '21 at 23:42
Those values for the $V_i$ are ridiculously wrong. The underflow is working properly. The only probability that is practically distinguishable from zero is associated with the largest $V_i,$ which predicts a probability of $1.$ That clearly doesn't match the `real_probs` values. Thus, if those $V_i$ were output by some fitting procedure, your problem is a bug in the procedure, not overflow. — whuber, Nov 16 '21 at 00:00
A closer look at your data suggests that your `V` gives the values of $\exp(V_i),$ because they are perfectly correlated with your `real_probs`. Thus, you shouldn't be exponentiating them at all. — whuber, Nov 17 '21 at 19:30
Thanks @whuber, I guess that's possible because I am using a package that may have outputted the exponentiated results already so there is no need to exponentiate again! I have check that out. — Amin Shn, Nov 18 '21 at 10:19

Ben · Answer 1 · 2021-11-15T23:30:19.587

This question seems to me like it is built on a false premise of how precision methods in floating-point arithmetic work. If any $\theta_i$ value is positive, but is below the smallest number that can be represented in your floating-point arithmetic (at whatever bit-size you are using), then it is going to get rounded down to zero. That is an unavoidable aspect of using floating-point representation of numbers directly on the value of interest.

Now, by taking the sofmax transformation, your goal is to be able to compute the values of $V_i$ without overflow or underflow. You can also compute the values of $\log \theta_i$ using the fact that:

$$\log \theta_i = V_i - \text{logsumexp}(V_1,...,V_n).$$

Of course, even if you do this, any value of $\theta_i$ that is below the smallest floating-point number in your system is still going to get rounded to zero or one due to the limited precision. But who cares? If you can compute $\log \theta_i$ without overflow or underflow, that should give you what you need for any subsequent computations. And if your ultimate goal is to get a non-zero value for $\theta_i$ then you are going to have to decide how many bits you are willing to use to represent it.

Now, if you really want to get an accurate value for $\theta_i$, another way you can do this is to try to represent it as being "squeezed" between two rational numbers represented using arbitrary-precision arithmetic. Given sufficient computing power, that will allow you to compute the number down to an arbitrarily small interval with exactly known bounds, so you can avoid underflow.

Your data: In relation to your posted data, your values for V in your data-frame df do not match the probability values you put in, and it is not clear where they come from. If you want to find the values that correspond to these probabilities you can use the softmaxinv function in the utilities package.

#Load library
library(utilities)
softmaxinv(df$real_probs)
[1] -0.1329884 -0.3741176 -1.7729265  1.5352031  2.7001825  1.3545863 -0.3321723  3.4831880  2.7957656  2.7414939

As to the predicted probabilities you are getting with your V values, those are correct (to within a tiny margin of error that is far below the minimum FLOP). Your eighth value in V is over $50,000$ more than the nearest value, so the probability is over $\exp(50,000)$ times as big as the nearest value. Unsurprisingly, this large probability is being rounded to one and all other values to zero.

I can't calculate the Theta when I have almost all of the Vs being Inf! I added the data and code to show it. I tried your logsumexp approach but that gives a single 1 and all the other predicted probabilities are 0 which is not desirable because I want to know by what margin I am overestimating/underestimating the probabilities. A simple V/Sum(V) with no transformation gave very accurate results. — Amin Shn, Nov 15 '21 at 23:23
Vs are the utility functions (V = XB) fitted by my model and now I have to use them to calculate the predicted probabilities associated with each class (id). How do you calculate the predicted probabilities using the fitted values? — Amin Shn, Nov 15 '21 at 23:40

Softmax overflow AND underflow

1 Answers1