I have noticed that PyTorch models perform significantly better when ReLU is used instead of Softplus with Adam as optimiser.
How can it happen to be that a non-differentiable function is easier to optimise than an analytic one? Is it true, then, that there is no gradient optimisation except than in name, and some kind of combinatorics is used under the hood?