Is Greedy Layer-Wise Training of Deep Networks necessary for successfully training or is stochastic gradient descent enough?

Question

Is it possible to achieve state of the art results by using back-propagation only (without pre-training) ?

Or is it so that all record breaking approaches use some form of pre-training ?

Is back-propagation alone good enough ?

rcpinto · Accepted Answer · 2016-08-31T17:54:31.310

10

Pre-training is no longer necessary. Its purpose was to find a good initialization for the network weights in order to facilitate convergence when a high number of layers were employed. Nowadays, we have ReLU, dropout and batch normalization, all of which contribute to solve the problem of training deep neural networks. Quoting from the above linked reddit post (by the Galaxy Zoo Kaggle challenge winner):

I would say that the “pre-training era”, which started around 2006, ended in the early ’10s when people started using rectified linear units (ReLUs), and later dropout, and discovered that pre-training was no longer beneficial for this type of networks.

From the ReLU paper (linked above):

deep rectifier networks can reach their best performance without requiring any unsupervised pre-training

With that said, it is no longer necessary, but still may improve performance in some cases where there are too many unsupervised (unlabeled) samples, as seen in this paper.

edited Aug 31 '16 at 17:54

answered Aug 31 '16 at 16:43

rcpinto

1,616
12
14

This is a good answer but I think it would be even better if you could find an academic reference, rather than a Reddit thread. – Sycorax Aug 31 '16 at 17:32
Aren't the 3 above enough? It's even written in the abstract of the first one. – rcpinto Aug 31 '16 at 17:41
The revision is exactly the kind of support for the claim that "pre-training is no longer necessary" that I was hoping for. Thank you for contributing to our website. – Sycorax Aug 31 '16 at 17:54
2

Here is a related question : is pre-training doing the same as dropout (in some sense) ? – Sep 02 '16 at 06:57
This answer is **very wrong**, or at best misleading; BN, Dropout, etc serve roles largely orthogonal to pretraining, where latter enables _transferable feature learning_. Example: denoising, dimensionality-reducing, missing-data reconstructing timeseries autoencoder's encoder placed at input prior to a classifier neural net; the classifier learns _discriminatory_ features, very distinct from that of the autoencoder. – OverLordGoldDragon Dec 29 '19 at 19:32
I didn't say they do pretraining. – rcpinto Dec 29 '19 at 20:16

Is Greedy Layer-Wise Training of Deep Networks necessary for successfully training or is stochastic gradient descent enough?

1 Answers1