33

Why are activation functions of rectified linear units (ReLU) considered non-linear?

$$ f(x) = \max(0,x)$$

They are linear when the input is positive and from my understanding to unlock the representative power of deep networks non-linear activations are a must, otherwise the whole network could be represented by a single layer.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Aly
  • 1,149
  • 2
  • 15
  • 24
  • There's a similar question asked before: https://stats.stackexchange.com/questions/275358/why-is-increasing-the-non-linearity-of-neural-networks-desired though it's probably not a duplicate – Aksakal Mar 21 '18 at 19:19

1 Answers1

40

RELUs are nonlinearities. To help your intuition, consider a very simple network with 1 input unit $x$, 2 hidden units $y_i$, and 1 output unit $z$. With this simple network we could implement an absolute value function,

$$z = \max(0, x) + \max(0, -x),$$

or something that looks similar to the commonly used sigmoid function,

$$z = \max(0, x + 1) - \max(0, x - 1).$$

By combining these into larger networks/using more hidden units, we can approximate arbitrary functions.

$\hskip2in$RELU network function

Lucas
  • 5,692
  • 29
  • 39
  • Would these types of hand-constructed ReLus be built apriori and hard coded in as layers? If so, how would you know that your network required one of these specially built ReLus in particular? – Monica Heddneck Sep 16 '16 at 07:53
  • 5
    @MonicaHeddneck You could specify your own non-linearities, yes. What makes one activation function better than another is a constant research topic. For example, we used to use sigmoids, $\sigma(x) = \frac{1}{1 + e^{-x}}$, but then due to the vanishing gradient problem, ReLUs became more popular. So it's up to you to use different non-linearity activation functions. – Tarin Ziyaee Sep 19 '16 at 21:02
  • 1
    How would you approximate $e^x$ with ReLU in out of sample? – Aksakal Sep 12 '18 at 21:42
  • 1
    @Lucas, So basically if combine(+) >1 ReLUs we can approximate any function, but if we simply `reLu(reLu(....))` it will be linear always? Also, here you change `x` to `x+1`, that could be thought as `Z=Wx+b` where W & b changes to give different variants of such kind `x` & `x+1`? – Anu Mar 31 '19 at 00:12