1

I have noticed that PyTorch models perform significantly better when ReLU is used instead of Softplus with Adam as optimiser.

How can it happen to be that a non-differentiable function is easier to optimise than an analytic one? Is it true, then, that there is no gradient optimisation except than in name, and some kind of combinatorics is used under the hood?

Mike Land
  • 11
  • 1

1 Answers1

1

ReLU in general is known to outperform many smoother activation functions. It’s easy to optimize, because it’s half-linear. The advantage when using it is usually speed, so it can be the case that if you waited more iterations, used different learning rate, batch sizes, or other hyperparameters, etc, you’d get similar results.

Tim
  • 108,699
  • 20
  • 212
  • 390