4

I am implementing multilayer perceptrons with the softmax activation function over Theano. In some extreme cases I am running into problems with too high/low values in the softmax function that originate some distributions that are in some places equal to zero.

When computing the logarithm of these I get -inf and the error propagates through all the code.

My simple solution was adding a small constant to the distribution like this:

self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b) + 0.0000001

I already googled and found plenty of solutions that were more elegant than mine (and exact), but the nature of Theano demands something different since the log-likelihood will be symbollicaly differentiated to find the gradients for the algorithm.

Also, I found weird that this problem is not commonly adressed for neural networks, logistic regression or whatsoever. Are these kind of values so extreme that it actually indicates problems in another part of my system? Am I doing something wrong here or missing some point?


Update 1: Theano can give you some very different results depending on what mode tag you're using. Here I think I was using mode = FAST_COMPILE and apparently it deactivated the numerical optimizations and stabilizations for the function graphs done by the compiler. If you're doing this try changing it to mode = FAST_RUN


Update 2: This page lists some optimizations made by Theano including a specific one for softmax: local_log_softmax

amoeba
  • 93,463
  • 28
  • 275
  • 317
gsmafra
  • 257
  • 2
  • 10

2 Answers2

2

Looks like you answered your own question. However, you should check how they implemented log-softmax. See my answer here for a numerically stable softmax function. Therefore, log softmax should be:

def log_softmax(q):
    max_q = max(0.0, np.max(q))
    rebased_q = q - max_q
    return rebased_q - np.logaddexp(-max_q, np.logaddexp.reduce(rebased_q))

As long as your inputs are finite, I don't think this can ever be infinite.

Neil G
  • 13,633
  • 3
  • 41
  • 84
  • The implementation is in the module [nnet](https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/nnet.py), function `make_out_pattern` (no idea why this name). I don't know if ends up being identical to yours with numpy but they are very similar – gsmafra Jul 26 '15 at 17:12
  • @gsmafra that function is pretty close (I elide the final component of the input and output). However, they should have used `logaddexp` rather than `log(sum(exp` as they did. This is what `numpy` and `scipy` do. – Neil G Jul 27 '15 at 17:59
  • Putting everything in one operation is probably being done by the compiler, you can see [here](http://deeplearning.net/software/theano/tutorial/printing_drawing.html) some of the stuff Theano does to transform a raw ``symbolic'' function into real code – gsmafra Jul 27 '15 at 18:47
  • @gsmafra did you read the code? This: `tensor.log(tensor.exp(stabilized_X).sum(axis=1))` is unlikely to be transformed, and it's not numerically stable when any of the components are large. It's one thing to transform for computational optimization. It's another thing to transform for numerical stability, for which you would have to look at the whole computation. – Neil G Jul 27 '15 at 18:56
  • is the softmax the same as sigmoid ? I think sigmoid also has a 1 in the numitor, whereas softmax does not. – Ciprian Tomoiagă Jan 27 '17 at 18:19
  • 1
    @CiprianTomoiaga “Sigmoid” just means s-shaped. Plenty of functions are s-shaped. You probably mean the logistic function. Softmax is the generalization of the logistic function to $n$ components. – Neil G Jan 27 '17 at 18:21
  • 1
    I did mean _logistic function_, yes. Thank you for the clarification that softmax `==` multinomial logistic ! Worth specifying it in the numerically stable one too – Ciprian Tomoiagă Jan 27 '17 at 18:31
0

Input to softmax (z) should not be in broad range [0 , 100 > ]. Does softmax stacked to a network? If so, input from the previous layer must be between [0, 1] (assume sigmoid). In that case, z = wx + b may exceed the range specified above only if w and b are highly varying. Check that your w and b doesn't reach to big values. You may use L1/L2 regularization to reduce variation in 'w'. If the softmax uses data as input, you may consider to normalize it. Simply, check your network and diagnose the reason that makes input to softmax highly varying.

yasin.yazici
  • 1,609
  • 9
  • 10
  • The problem isn't the input to softmax. The problem is that he is taking the log of its result, so that precision problems when softmax is close to zero yield negative infinity after the log. He needs to calculate log-softmax directly. – Neil G Jul 26 '15 at 16:47