Does ReLU layer work well for a shallow network?

Question

I am currently working on training a 5-layer neural network, and I got some problems with tanh layer and would like to try ReLU layer. But I found that it becomes even worse for ReLU layer. I am wondering if it is due to that I did not find the best parameters or simply because ReLU is only good for deep networks?

Thanks!

as far as I know from the DNN literature, ReLu networks are the most dominate activations, specially for deep networks because they rarely have vanishing/exploding gradient issue when training. — Charlie Parker, Jul 26 '16 at 20:48
5 layered neural network is not usually considered shallow. Shallow is usually reserved for singled layer. — Charlie Parker, Jul 27 '16 at 17:19

score 6 · Accepted Answer · answered Jul 05 '18 at 16:11

6

Changing the activation function interacts with all of the other configuration choices that you've made, from the initialization method to the regularization parameters. You'll have to tune the network again.

answered Jul 05 '18 at 16:11

Sycorax

76,417
20
189
313

score 5 · Answer 2 · answered Dec 08 '18 at 05:03

When you replace sigmoid or tanh with ReLU, typically you will also need to:

Decrease your learning rate significantly, usually by 1/100th. This is because ReLU output grows without bound and is much less resistant to high learning rates.
Increase number of parameters (i.e. weight) by around 2X or more. This is because of dead relu issue.
You might have to increase number of epochs due to much lower LR.
You will typically need better initialization method than random init, such as Glorot init or He init. Many times you can get by without this but at expense of much slower convergence.
Very likely you will also need more stronger regularization such as dropout, again because of larger number of parameters and increased numbers of epochs.

So in summary, things are not as simple as swapping sigmoid/tanh with ReLU. As soon as you add ReLU, you need above changes to compensate for other effects.

perhaps you may need more layers of neural network, plus larger number of input data? For example: this guy has problem of convergence with shallow network (2 layer): https://stats.stackexchange.com/questions/284203/why-relu-activation-cannot-fit-my-toy-example-sinus-function-keras?noredirect=1&lq=1 — Peter Teoh, Apr 23 '19 at 13:37

ironman · Answer 3 · 2018-07-17T11:50:35.023

ReLU i.e. Rectified Linear Unit and tanh both are non-linear activation function applied to neural layer. Both have their own importance. It only depends on the problem in hand that we want to solve and the output that we want. Sometimes people prefer to use ReLU over tanh because ReLU involves less computation.

When I started studying Deep Learning, I had the question Why do we not just use linear activation function instead of non-linear? Answer is output will be just linear combination of input and hidden layer will have no effect and so hidden layer will not be able to learn important feature.

For example, if we want the output to lie within (-1,1) then we need tanh. If we need output between (0,1) then use sigmoid function . In case of ReLU it will give max{0,x}.There are many other activation functions like leaky ReLU.

Now in order to choose appropriate activation function for our purpose to give better result it is just a matter of experiment and practice which is known as tuning in data science world.

In your case, you may need to tune your parameter which is known as parameter tuning like number of neurons in hidden layers, number of layers etc.

Does ReLU layer work well for a shallow network?

Yes,of course ReLU layer work well for a shallow network.

naive · Answer 4 · 2018-07-06T07:08:05.853

I am wondering if it is due to that I did not find the best parameters or simply because ReLU is only good for deep networks?

I believe I can safely assume that you mean hyperparameters instead of parameters.

A neural network with 5 hidden layers is not shallow. You can consider it deep.

The hyperparameter space search for 'best' hyperparameters is a never ending task. By best I mean the hyperparameters that lets the network attain the global minima.

I agree with Sycorax that once you change the activation function you need to tune the network again. Usually, one can achieve comparable performance across many different configurations of hyperparams for the same task.

Does ReLU layer work well for a shallow network?

4 Answers4

Linked

Related