11

I understand that pretraining is used to avoid some of the issues with conventional training. If I use backpropagation with, say an autoencoder, I know I'm going to run into time issues because backpropagation is slow, and also that I can get stuck in local optima and not learn certain features.

What I don't understand is how we pretrain a network and what specifically we do to pretrain. For example, if we're given a stack of restricted Boltzmann Machines, how would we pretrain this network?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Michael Yousef
  • 251
  • 2
  • 3
  • 2
    Unless you are in a setting with only a few labeled and many unlabeled samples, pretraining is considered obsolete. If that is not the case case, using a rectifier transfer function $f(x) = \max(x, 0)$ and advanced optimisers (rmsprop, adadelta, adam) works equally well for deep neural networks. – bayerj Apr 22 '15 at 18:25
  • Yeah, I'm working under an assumption that there's a large amount of unlabeled samples and few to no labeled samples. – Michael Yousef Apr 22 '15 at 18:29

3 Answers3

2

You start by training each RBM in the stack separately and then combine into a new model which can be further tuned.

Suppose you have 3 RBMs, you train RBM1 with your data (e.g a bunch of images). RBM2 is trained with RBM1's output. RBM3 is trained with RBM2's output. The idea is that each RBM models features representative of the images and the weights that they learn in doing so are useful in other discriminative tasks like classification.

mnagaraj
  • 41
  • 3
0

Pretraining a stacked RBM is to greedily layerwise minimize the defined energy, i.e., maximize the likelihood. G. Hinton proposed the CD-k algorithm, which can be viewed as a single iteration of Gibbs sampling.

Mou
  • 638
  • 2
  • 5
  • 14
  • So pretraining the stacked RBM allows us to minimize the defined energy and get better results. And then Hinton's Contrastive Divergence algorithm is how we would actually pretrain. How exactly does pretraining factor in to learning extra features? I assume for the speed issue, the CD algorithm is much faster than backpropagation. – Michael Yousef Apr 22 '15 at 18:28
0

Pretraining is a multi-stage learning strategy that a simpler model is trained before the training of the desired complex model is performed.

In your case, the pretraining with restricted Boltzmann Machines is a method of greedy layer-wise unsupervised pretraining. You train the RBM layer by layer with the previous pre-trained layers fixed.

Pretraining helps both in terms of optimization and generalization.

Reference:

Deep Learning by Ian Goodfellow and etc.

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52