Which output activation is recommended when predicting a variable with a lower but not an upper bound?

Question

I need to predict something using a neural network. The output values are bound to be non-negative, but there's not really an upper bound. I do know that the output is never going to be higher than a certain level in practice. Also, my expected output can should span all numbers between $0$ and the maximum.

So, which output activation function should I use? Sigmoid seems wrong, as the gradient would give too much importance to high value near the maximum. Unless I scaled my data so that the maximum value I ever encounter is around 0.6, so that this output behaves like a sigmoid near 0 and linearly in the rest of the image. Linear doesn't seem right as it allows negative outputs. ReLU by definition gives me an output in the correct range... but it's not really well behaved.

Any suggestion?

In what way is ReLU not well behaved? It seems like the obvious choice — Cameron Chandler, Oct 26 '20 at 09:38
In that it just "cuts" negative values, so its gradient is equally zero everywhere for negative values — user26067, Oct 26 '20 at 09:45
Right, but the bias addition before final activation should take care of that. The losses associated with outputting zero when the real value is non-zero will let the network learn to deal with this. — Cameron Chandler, Oct 26 '20 at 10:26

score 4 · Answer 1 · answered Nov 06 '20 at 16:39

Linear is actually quite reasonable for those reasons you mentioned (e.g. gradient doesn't get cut off at 0). It's not a big deal that you might get negatives, because at val/test time you can simply clip to 0.

This also depends on roughly how you expect the outputs to be distributed. For example, when predicting depth from an input image, common schemes include predicting 1/depth or log depth (which you can alternatively think of as using the activation 1/x or exp e, but directly transforming outputs is probably a better idea).

score 4 · Answer 2 · answered Nov 06 '20 at 17:07

You could try one of the smooth relaxations of ReLU, like softplus (albeit more often than not it's outperformed by ReLU, it's usually used for variance terms in VAEs, where zeros are not allowed):

$$\operatorname{softplus}(x) = \log(1+\exp x)$$

See, however, What are the benefits of using ReLU over softplus as activation functions? and ReLU outperforming Softplus

Which output activation is recommended when predicting a variable with a lower but not an upper bound?

2 Answers2