1

I am doing Andew Ng's Deep Learning course and he says that ReLU is better than Sigmoid, but it makes no sense to me at all.

The biggest advantage of activation functions are too get a non-linear function that means that cascading is non-linear and when stacking layers, we can learn non-linear functions.

But I can't understand why ReLU is anybetter since it is still a straight line. Just two straight lines. The function is discontinous at zero too from what I understand?

It just makes no sense to me why ReLU would be a good function.

Edit: Nothing about ReLU makes sense to me. It is two straight lines made worse by one value being forced to zero and I cannot understand why we'd want to force $g(z)=0 \forall z<0$. I can't see how this is any better than $g(z)=z$

  • HI @mr-johnny-doe ! Do not hesitate to comment our answers if you need more clarifications... – meduz Mar 17 '21 at 11:51

2 Answers2

3

As can be seen in the pytorch documentation for the ReLU, this is not a straight line, and not discontinuous at zero, two important points not to miss:

pytorch documentation for the ReLU

First, if it would be linear, then the cascade of linear operations (such as a convolution) with it would just be yet another linear operator, such that you could as well collapse this cascade into a shallow one-layer transform. It is essential to understand that the goal is to create the most diverse "space" of functions - and that non-linearities such as the ReLU are essential for this. (Figure 1 of this review may help understand why non-linearities are essential.)

Second, it is non discontinuous at zero - this would generate problems for establishing the gradients which are essential to perform backpropagation. Instead, it is continuous at zero and its derivate (which is discontinuous at zero) is most easily computable as zero for negative values and one for positive.

Hope this helps!

Edit: Following other answers to similar questions, another reason for which the ReLU non-linearity is popular is the fact that it helps overcome the vanishing gradient problem. Indeed, when using a smooth function such as the sigmoid, the value of the gradient computed in back-propagation (from the last layer to the first) could become very small for deep networks - slowing down learning in the first layers. The ReLU (among other mechanisms such as batch normalization) can help overcome this problem due to its shape.

meduz
  • 552
  • 2
  • 9
  • Because of "Second" I could write my post a little bit shorter ;D – Patrick Bormann Mar 10 '21 at 13:26
  • I stumbled on this question while writing another one ;-) – meduz Mar 10 '21 at 13:33
  • I get that it may not be a straight line. But I can't associate any intuition with ReLU. max(0,Z) just feels... wrong? Like something about it rubs me as being bad math. A good chunk of values remain untouched while another chunk of values are saturated to zero. It is very prone to exploding gradients too since the values don't saturate when Z>0. – Mr. Johnny Doe Mar 10 '21 at 13:35
  • The only good feature about ReLU from what I can see is that at least its slope is very easy to calculate. – Mr. Johnny Doe Mar 10 '21 at 13:36
  • @Mr.JohnnyDoe Don't forget about the bias: You don't necessarily throw away a lot at all. By changing the bias, the threshold where values are set to zero also changes. – Frans Rodenburg Mar 10 '21 at 13:39
  • that's one point - but the main one is that it *works very well* in many cases. it is always a good exercise to retrain a network with a different NL and compare results. – meduz Mar 10 '21 at 13:40
  • another point is that this function models well the rectifying behavior of *real* biological neurons when you consider the spiking firing rate which is either zero (for an inhibitory input current) or positive (for a positive synaptic input current) – meduz Mar 10 '21 at 13:44
  • So does that just mean I need to look at LeakyReLU and ReLU as something that makes no sense, but just works? It works well from what everyone says, but my brain refuses to accept it. – Mr. Johnny Doe Mar 10 '21 at 13:46
  • About the Bias thing, isn't that bad? I mean, if the Bias could change that much (i guess it could for sigmoid too), doesn't that just mean less dependance on data and more on some arbitrary constant? – Mr. Johnny Doe Mar 10 '21 at 13:47
  • It seems we had this question once: https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks – Patrick Bormann Mar 10 '21 at 13:48
  • yes, and that the responses to it insisted more on the resolution to the vanishing gradient problem – meduz Mar 10 '21 at 13:53
1

You should not confuse yourself in a way that you transform your values to a ReLu Function. You only use ReLu to, well, create several lines which will be "added together" so you can get several different kind of fitting functions. And it is much faster than Sigmoid. See here for using ReLu:

ReLu explained by Josh Starmer on StatQuest https://www.youtube.com/watch?v=68BZ5f7P94E

Patrick Bormann
  • 1,498
  • 2
  • 14
  • I love John, but ReLU is the least intuitive function I have ever seen in math and machine learning. It treats values before zero and after zero very differently. Sigmoid makes sense, 1 and 0 relate to class labels. Tanh relates to how we normalise data. But ReLU just doesn't make sense to me. – Mr. Johnny Doe Mar 10 '21 at 13:34