5

or, for example, is it good to use activation function only for a last layer?

As I know, if there are no activation functions in neural network, feedforward will be like simple matrix multiplication, but I don't understand why this is bad.

Dmytro Nalyvaiko
  • 577
  • 2
  • 6
  • 12

1 Answers1

5

Consider a two layer neural network. Let $x \in \mathbb{R}^n$ be your input vector, and consider a single layer without an activation function and weight matrix $A$ and bias $b$, it would computer $$Ax + b$$ A second layer (without activation and weight matrix $C$ and bias $d$) would then compute $$C(Ax + b)+d$$ This is equal to \begin{equation} CAx + Cb + d \end{equation} This is equivalent to a single layer neural network with weight matrix $CA$ and bias vector $Cb+d$. It is well known that single layer neural networks cannot solve some "simple" problems, for example, they cannot solve the XOR problem.

Suppose we have a single layer neural network $Ax + b$ that can solve the XOR problem. The matrix is of the form $A = (w_{1}, w_{2})$ since it takes two inputs and outputs a single value. Then

\begin{equation} 0w_1 + 0w_2 + b \leq 0 \iff b \leq 0 \\ 0w_1 + 1w_2 + b > 0 \iff b > -w_2 \\ 1w_1 + 0w_2 + b > 0 \iff b > -w_1 \\ 1w_1 + 1w_2 + b \leq 0 \iff b \leq -w_1 - w_2 \end{equation}

Suppose all of the left hand sides are true (which is required to solve the XOR problem) then we know $b \leq 0$. The second and third line gives that $w_1$ and $w_2$ are also negative (since they are less than $b$ which is negative). The fourth line gives $b \leq -w_1 - w_2$ then since $b$ is negative $2b \leq b \leq -w_1 - w_2$ but then $b > -w_2$ and $b > -w_1$ gives $2b > -w_1 - w_2$ which is a contradiction. Hence the single layer neural network cannot solve the XOR problem.

Less formally, $Ax+b$ defines a line which separates the plane such that all points are classified according to the side of the line they lay on. Try drawing a straight line such that $(0,0)$ and $(1,1)$ are on one side and $(0,1)$ and $(1,0)$ is on the other, you will not be able to.

Introducing non-linear activation functions between the layers allows for the network to solve a larger variety of problems. To be more precise the Universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons, can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function.

HBeel
  • 680
  • 7
  • 15
  • this is nice explanation, but I still don't understand why 1 layer can not approximate xor function. I can not imagine why it is so. – Dmytro Nalyvaiko Mar 12 '17 at 17:45
  • I've added a proof @DmitryNalyvaiko – HBeel Mar 12 '17 at 18:21
  • thanks, but when I wrote "1 layer", I meant layer "CAx + Cb + d" about which you wrote at the beggining . you considered network with 2 input layers and 1 output layer. what proof will be in case of network, for example, with size [2, 3, 1] where 2 is input size, 3 - hidden layer size, 1 - output layer size? – Dmytro Nalyvaiko Mar 12 '17 at 20:20
  • > Less formally, Ax+bAx+b defines a line which separates the plane such that all points are classified according to the side of the line they lay on. Try drawing a straight line such that (0,0)(0,0) and (1,1)(1,1) are on one side and (0,1)(0,1) and (1,0)(1,0) is on the other, you will not be able to. how to do that with curve? – Dmytro Nalyvaiko Mar 12 '17 at 20:39
  • @DmitryNalyvaiko so if you have [2, 3, 1] you would have $A \in \mathbb{R}^{3,2}$ and $C \in \mathbb{R}^{1,3}$ so $CA \in \mathbb{R}^{1,2}$, so the proof is still valid by calling elements $CA = (w_1, w_2)$. Similarly for the bias – HBeel Mar 13 '17 at 11:05
  • @DmitryNalyvaiko its harder to visualise what neural networks are doing when you introduce hidden layers, which is why they are sometimes said to have a "black box" nature. The [TensorFlow playground](http://playground.tensorflow.org) has a XOR dataset you can visualize what each neuron is doing the output decision boundaries – HBeel Mar 13 '17 at 11:11
  • @HBeel Given the gradients $\nabla A$ and $\nabla C$. Does this mean each time the total weight updated will be $(A- \mu\nabla A)(C- \mu\nabla C)$? Assume that both $A$ and $C$ are $n \times n$ matrices, will this neural network perform worse than those with one layer linear feed-forward in term of gradient descent convergence? – Minh Khôi Jan 23 '21 at 09:17