What are the benefits of using SoftPlus over ReLU activation functions?

Question

All the discussions online seem to be centered around the benefits of ReLU activations over SoftPlus. The general consensus seems to be that the use of SoftPlus is discouraged since the computation of gradients is less efficient than it is for ReLU.

However, I have not found any discussions on the benefits of SoftPlus over ReLU. Only that SoftPlus is more differentiable, particularly around x = 0.

I am using a novel loss function which contains gradients. Therefore, would SoftPlus be a better option than ReLU for such a use case?

Do you need higher order derivatives? Because, for ReLU they are all zero except at zero, where they are undefined. For softplus they exist everywhere . — Igor F., Jul 17 '21 at 10:24
I only need the first order derivatives, but they are the derivatives of the network output (i.e., final layer) with respect to the inputs. I've used SoftPlus at all the intermediate layers, and no activations after the final layer. In this case, would SoftPlus being more differentiable than ReLU matter? — InternetUser0947, Jul 17 '21 at 10:41
Based on the information provided I see no benefit of softplus. — Igor F., Jul 17 '21 at 16:50
I agree w/ @Igor F. Also, even with the discontinuity at x=0 in the ReLU function, in practice, you can still take a subgradient as your descent direction. I don't know if this impacts your loss function since I don't know what it is. Furthermore, [this might be relevant](https://stats.stackexchange.com/questions/146057/what-are-the-benefits-of-using-relu-over-softplus-as-activation-functions). — tchainzzz, Jul 17 '21 at 17:22
Cross-posted: https://cs.stackexchange.com/q/142359/755, https://stackoverflow.com/q/68385693/781723, https://stats.stackexchange.com/q/534908/2921. Please [do not post the same question on multiple sites](https://meta.stackexchange.com/q/64068). — D.W., Jul 18 '21 at 04:23

What are the benefits of using SoftPlus over ReLU activation functions?

0 Answers0