Is initializing the weights of autoencoders still a difficult problem?

Question

I was wondering if initializing the weights of autoencoders is still difficult and what the most recent strategies are for it.

I have been reading different articles. In one of Hinton's papers (2006), it says:

With large initial weights, autoencoders typically find poor local minima; with small initial weights, the gradients in the early layers are tiny, making it infeasible to train autoencoders with many hidden layers. If the initial weights are close to a good solution, gradient descent works well, but finding such initial weights requires a very different type of algorithm that learns one layer of features at a time. We introduce this "pretraining" procedure for binary data, generalize it to real-valued data, and show that it works well for a variety of data sets.

no. any standard init (glorot / xavier / he) should work fine. — shimao, Nov 11 '19 at 08:31

Danica · Answer 1 · 2019-11-11T16:57:38.787

These layerwise pretraining procedures are mostly not needed anymore, for a few reasons:

Better initialization schemes, e.g. Xavier / Glorot initialization (the same thing, named after Xavier Glorot) as shimao noted. These help avoid the problems of exploding or vanishing gradients where, essentially, multiplying many numbers significantly more than one gives a huge result, or where multiplying many numbers significantly less than one gives a tiny result. These initialization schemes keep the gradient norms closer to one.
The switch to non-saturating activation functions, like $\operatorname{LReLU}_{0.1}(x) = \begin{cases}x & x \ge 0 \\ 0.1 \, x & x < 0\end{cases}$, rather than saturating ones like $\operatorname{sigmoid}(x) = \frac{1}{1 + \exp(-x)} \in (0, 1)$ that were previously popular. Sigmoids only have useful signal for $x$ in a fairly tight set of inputs; too large or too small and the function becomes quite flat. Leaky ReLU has useful signal everywhere, and regular ReLU has useful signal for any positive input.
Batch normalization, and related schemes like weight/layer/spectral/... normalization, also help keep activations in a "nice" regime if you use them.
Architectural innovations like ResNets allow for very deep networks that can still be effectively trained.
There may be more of a trend now (as opposed to 2006) of using fairly wide hidden layers; this very-overparameterized regime has been shown over the last few years to be amenable to gradient descent optimization.
Adaptive optimizers, like Adam, may do a better job optimizing than previous algorithms.

There are still occasional settings where you see people doing layerwise training for one reason or another, but the vast majority of the time, it's not necessary anymore.

Is initializing the weights of autoencoders still a difficult problem?

1 Answers1