Why are non-linear activation functions required in multilayer perceptron classification?

Question

Solution: for some reason, I had forgotten that the non-linear activation function is applied at every layer of the neural network, not just at the output layer. Hopefully to others reading my original question below will understand why I asked it. Thank you for the answers, though.

Original: Suppose I have a multilayer perceptron network of a couple of layers with some output nodes that will be subject to the classic sigmoid activation function - how will this change which output node will have the highest value for a given input vector (and is selected as the final classification)? Namely, (denoting the sigmoid function as f(x)) if x' > x, f(x') > x, meaning the same output node will be selected as the final classification.

I think I am missing something about its importance in gradient descent or the determination of loss but please clarify this for me if you know my thinking error.

Write out a small network with one input variable, two nodes in the hidden layer, one output node, and linear activation functions. Write out the equation that forms. Does it look familiar? — Dave, Aug 27 '21 at 22:32

score 1 · Accepted Answer · answered Aug 27 '21 at 22:52

1

Non-linear activation functions are needed because a linear combination of linear functions is still a linear function. So without non-linear activation functions, the multilayered perceptron won't be able to learn non-linear relationships from the data, basically it would have the same capabilities as "network" with only one neuron. So if there is a non-linear relationship between input and output, or there are interactions between variables, the network won't be able to learn these things.

answered Aug 27 '21 at 22:52

rep_ho

6,036
1
22
44

I don't feel like this really answers my question. How does the non-linear activation change the classification chosen from what would be chosen without the non-linear activation function? Edit: Given that the selection rule is that the classification is chosen based on the highest output node value. – User Aug 27 '21 at 22:55
@User if you are asking why have output node with a non-linear activation function and not just linear, this is because usually you code your targets as 1 or 0, and with a sigmoid function, the network can only output numbers between 0 and 1, but if it's a linear function, than outputs are not bounded, but the errors are still counted based on your 0-1 coding of the target variable. – rep_ho Aug 27 '21 at 23:20

score -1 · Answer 2 · answered Aug 30 '21 at 07:33

with your question, you can come back to the problem of XOR function that once was not solved by NNs without non-linear activation functions whereby NNs lost their attention until non-linear functions were involved.

https://stackoverflow.com/questions/33582251/can-an-ann-2-2-1-layers-be-implemented-to-learn-xor-using-linear-activation-fu

https://www.quora.com/Why-cant-the-XOR-problem-be-solved-by-a-one-layer-perceptron

Why are non-linear activation functions required in multilayer perceptron classification?

2 Answers2