or, for example, is it good to use activation function only for a last layer?
As I know, if there are no activation functions in neural network, feedforward will be like simple matrix multiplication, but I don't understand why this is bad.
or, for example, is it good to use activation function only for a last layer?
As I know, if there are no activation functions in neural network, feedforward will be like simple matrix multiplication, but I don't understand why this is bad.
Consider a two layer neural network. Let $x \in \mathbb{R}^n$ be your input vector, and consider a single layer without an activation function and weight matrix $A$ and bias $b$, it would computer $$Ax + b$$ A second layer (without activation and weight matrix $C$ and bias $d$) would then compute $$C(Ax + b)+d$$ This is equal to \begin{equation} CAx + Cb + d \end{equation} This is equivalent to a single layer neural network with weight matrix $CA$ and bias vector $Cb+d$. It is well known that single layer neural networks cannot solve some "simple" problems, for example, they cannot solve the XOR problem.
Suppose we have a single layer neural network $Ax + b$ that can solve the XOR problem. The matrix is of the form $A = (w_{1}, w_{2})$ since it takes two inputs and outputs a single value. Then
\begin{equation} 0w_1 + 0w_2 + b \leq 0 \iff b \leq 0 \\ 0w_1 + 1w_2 + b > 0 \iff b > -w_2 \\ 1w_1 + 0w_2 + b > 0 \iff b > -w_1 \\ 1w_1 + 1w_2 + b \leq 0 \iff b \leq -w_1 - w_2 \end{equation}
Suppose all of the left hand sides are true (which is required to solve the XOR problem) then we know $b \leq 0$. The second and third line gives that $w_1$ and $w_2$ are also negative (since they are less than $b$ which is negative). The fourth line gives $b \leq -w_1 - w_2$ then since $b$ is negative $2b \leq b \leq -w_1 - w_2$ but then $b > -w_2$ and $b > -w_1$ gives $2b > -w_1 - w_2$ which is a contradiction. Hence the single layer neural network cannot solve the XOR problem.
Less formally, $Ax+b$ defines a line which separates the plane such that all points are classified according to the side of the line they lay on. Try drawing a straight line such that $(0,0)$ and $(1,1)$ are on one side and $(0,1)$ and $(1,0)$ is on the other, you will not be able to.
Introducing non-linear activation functions between the layers allows for the network to solve a larger variety of problems. To be more precise the Universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons, can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function.