How does one Initialize Neural Networks as suggested by Saxe et al using Orthogonal matrices and a gain factor?

Question

I was reading Bengio, Goodfellow and Courville deep learning book and on chapter 8 (Optimization chapter) they mention that Saxe et al have a initialization based on orthogonal matrices and a gain factor $g$ that depends on the non-linearity. The chapter doesn't actually say how to do this initialization. To address this issue I tried reading the paper but it seems to be a bit beyond my level of (maths) sophistication. Does anyone understand what the initialization that they are referring to is done?

For example the question's that would be nice to know are:

How does one choose orthogonal matrices? Just any K orthogonal matrices for any weight matrices?
How is $g$ chosen depending on the non-linearity?

I probably should have mentioned but Iam planning to use it with python/tensorflow if possible.

3 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, Andrew M. Saxe, James L. McClelland, Surya Ganguli

Franck Dernoncourt · Answer 1 · 2016-08-08T02:04:05.877

Here is what Lasagne does, it should answer your two questions:

class Orthogonal(Initializer):
    """Intialize weights as Orthogonal matrix.
    Orthogonal matrix initialization [1]_. For n-dimensional shapes where
    n > 2, the n-1 trailing axes are flattened. For convolutional layers, this
    corresponds to the fan-in, so this makes the initialization usable for
    both dense and convolutional layers.
    Parameters
    ----------
    gain : float or 'relu'
        Scaling factor for the weights. Set this to ``1.0`` for linear and
        sigmoid units, to 'relu' or ``sqrt(2)`` for rectified linear units, and
        to ``sqrt(2/(1+alpha**2))`` for leaky rectified linear units with
        leakiness ``alpha``. Other transfer functions may need different
        factors.
    References
    ----------
    .. [1] Saxe, Andrew M., James L. McClelland, and Surya Ganguli.
           "Exact solutions to the nonlinear dynamics of learning in deep
           linear neural networks." arXiv preprint arXiv:1312.6120 (2013).
    """
    def __init__(self, gain=1.0):
        if gain == 'relu':
            gain = np.sqrt(2)

        self.gain = gain

    def sample(self, shape):
        if len(shape) < 2:
            raise RuntimeError("Only shapes of length 2 or more are "
                               "supported.")

        flat_shape = (shape[0], np.prod(shape[1:]))
        a = get_rng().normal(0.0, 1.0, flat_shape)
        u, _, v = np.linalg.svd(a, full_matrices=False)
        # pick the one with the correct shape
        q = u if u.shape == flat_shape else v
        q = q.reshape(shape)
        return floatX(self.gain * q)

This RNN tutorial does the same thing (minus the gain):

# orthogonal initialization for weights
# see Saxe et al. ICLR'14
def ortho_weight(ndim):
    W = numpy.random.randn(ndim, ndim)
    u, s, v = numpy.linalg.svd(W)
    return u.astype('float32')

So I assume it's correct (I hope so since this is the code I use).

I probably should have mentioned but Iam planning to use it with python/tensorflow if possible.

In TensorFlow:

def orthogonal_initializer(scale = 1.1):
    ''' From Lasagne and Keras. Reference: Saxe et al., http://arxiv.org/abs/1312.6120
    '''
    print('Warning -- You have opted to use the orthogonal_initializer function')
    def _initializer(shape, dtype=tf.float32):
      flat_shape = (shape[0], np.prod(shape[1:]))
      a = np.random.normal(0.0, 1.0, flat_shape)
      u, _, v = np.linalg.svd(a, full_matrices=False)
      # pick the one with the correct shape
      q = u if u.shape == flat_shape else v
      q = q.reshape(shape) #this needs to be corrected to float32
      print('you have initialized one orthogonal matrix.')
      return tf.constant(scale * q[:shape[0], :shape[1]], dtype=tf.float32)
    return _initializer

So, to explain this without use of code, the weights are initialized as the orthogonal projection of a random matrix? — Sycorax, Aug 08 '16 at 02:38
Reading through the tensorflow docs on [convolutional operations](https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#conv2d), it seems the `flat_shape` should be `(shape[:-1], shape[-1])` because tensorflow uses a different format to encode its data (number of filters as the last dimension). Thoughts? — Till Hoffmann, Oct 24 '16 at 07:34
And that should of course should be `(np.prod(shape[:-1]), shape[-1])` but I can no longer edit the comment. — Till Hoffmann, Oct 24 '16 at 08:05
@TillHoffmann Unsure, I haven't had much experience with TensorFlow so far. — Franck Dernoncourt, Oct 24 '16 at 13:35
@FranckDernoncourt can we use above to find orthogonal vector to a existing vector lets say of 10000 dim? any hints will be very helpful — Pradi KL, May 21 '20 at 03:05

How does one Initialize Neural Networks as suggested by Saxe et al using Orthogonal matrices and a gain factor?

1 Answers1

Linked