34

It is often mentioned that rectified linear units (ReLU) have superseded softplus units because they are linear and faster to compute.

Does softplus it still have the advantage of inducing sparsity or is that restricted to the ReLU?

The reason I ask is it I wonder about negative consequences of the zero slope of the ReLU. Doesn't this property "trap" units at zero where it might be beneficial to give them the possibility of reactivation?

amoeba
  • 93,463
  • 28
  • 275
  • 317
brockl33
  • 441
  • 4
  • 5

3 Answers3

13

I found an answer to your question in the Section 6.3.3 of the Deep Learning book. (Goodfellow et. al, 2016):

The use of softplus is generally discouraged. ... one might expect it to have advantage over the rectifier due to being differentiable everywhere or due to saturating less completely, but empirically it does not.

As a reference to support this claim they cite the paper Deep Sparse Rectifier Neural Networks (Glorot et. al, 2011).

Alexander Shchur
  • 435
  • 4
  • 10
  • 2
    Id be wary of such empirical claims about things like activation functions. Maybe theyll have aged well in the past ten years. Maybe they havnt. RELUs are generally fine in wide layers like youd find in neural networks, where the contribution of individual neurons is generally quite non critical and you can pla ya shotgun approach to them dying. If you are dealing with models that have a handful of neurons per layer only, youd better stay away from any activation functions with vanishing gradients at all. – Eelco Hoogendoorn Apr 22 '21 at 18:36
4

ReLUs can indeed be permanently switched off, particularly under high learning rates. This is a motivation behind leaky ReLU, and ELU activations, both of which have non-zero gradient almost everywhere.

Leaky ReLU is a piecewise linear function, just as for ReLU, so quick to compute. ELU has the advantage over softmax and ReLU that it's mean output is closer to zero, which improves learning.

Hugh Perkins
  • 4,279
  • 1
  • 23
  • 38
  • 3
    "almost everywhere" is a technical term that means something like "except at a few infinitely small points". For example, leaky ReLU has no gradient defined at x=0. – Hugh Perkins Oct 21 '19 at 16:07
1

The main reason ReLU works better than Softplus is that for ReLU we have the idea of sparsity in the model. This means that some of the neurons of the model output zero which does not have any effect for the next layers. this idea is something like Dropout. Neurons in hidden layers learn hidden concepts. If the input does not contain the corresponding concept, some neurons will output zero and they will not be engaged in the calculations of the next layers. This idea cannot be in Softplus, because the output cannot be zero like ReLU.