What is the difference between feature scaling and weight initialization scaling?

Question

I build an MLP to classify instances of the fashion MNIST dataset. You can run/modify the code in this Google Colab Notebook.

When the features are downscaled by a factor of 255 (feature_scale_factor=255.0) and the weights of the first dense layer are initialized with via glorot initialization with default settings (weight_scale_factor=1.0) the network converges fast.

When the features are not downscaled (feature_scale_factor=1.0) and the initialized weights are downscaled by a factor of 255 (weight_scale_factor=255.0) the network does not converge (or rather converges extremly slow).

Stating Sycorax says Reinstate Monicas answer on this question,

If we apply scaling so that inputs are $X_{ij}\in [0,1]$, then activations for the first layer during the first iteration are are $$X\theta^{(1)} + \beta^{(1)}$$

and at convergence are $$X\theta^{(n)} + \beta^{(n)},$$ where the weights are $\theta$, the bias is $\beta$.

Network initialization draws values from some specific distribution, usually concentrated in a narrow interval around 0. If you don't apply scaling, then activations for the first layer during the first iteration are are

$$255\cdot X\theta^{(1)} + \beta^{(1)}$$

So the effect of multiplying by the weights is obviously 255 times as large.

should the convergence behaviour of the network not be the same in both scenarios?

Here is the code you'll find in the notebook:

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.optimizers import SGD

# Get fmnist dataset
feature_scale_factor = 255.0 # Model converges with scale factor of 255.0
(X_opt, y_opt), (_, _) = fashion_mnist.load_data()  
X_train, X_val = X_opt[:55000] / feature_scale_factor, X_opt[55000:] / feature_scale_factor
y_train, y_val = y_opt[:55000], y_opt[55000:]

fmnist_train = tf.data.Dataset.from_tensor_slices((X_train, y_train))
fmnist_train = fmnist_train.shuffle(5000).batch(32, drop_remainder=True)
fmnist_val = tf.data.Dataset.from_tensor_slices((X_val, y_val))
fmnist_val = fmnist_val.shuffle(5000).batch(32, drop_remainder=True)
print('\nDataset batch structure:')
print(fmnist_train.element_spec[0]) 

def my_glorot_initializer(shape, dtype=tf.float32):
    weight_scale_factor = 1.0
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)/weight_scale_factor

#Build Model
mlp = Sequential([
    Flatten(input_shape=[28, 28], name='Flatten'),
    Dense(300, activation='relu', kernel_initializer=my_glorot_initializer, name='Input_Layer'),
    Dense(100, activation='relu', name='H1'),
    Dense(10, activation='softmax', name='Output_Layer')
], name='MLP')
print()
mlp.summary()
mlp.save_weights('model.h5')

# Compile Model
mlp.compile(loss='sparse_categorical_crossentropy',
            optimizer=SGD(learning_rate=0.1),
            metrics=['accuracy'])

mlp.load_weights('model.h5') # reset model to initialization state
history = mlp.fit(fmnist_train,
                  epochs=2,
                  validation_data=fmnist_val)

score 1 · Answer 1 · answered Feb 18 '20 at 20:30

1

The difference would be in the first step in weights when you calculate the back propagation. To calculate the gradient NN will make a step in weights $\Delta\theta$. Although the product $X\theta$ is the same when you start, the first step $X\Delta\theta$ will be very different because the scales of $X$ are different, by two orders of magnitude in fact. Whether it slows down or speeds up the convergence is probably a function of many factors, and it happens so that in your example it's slower to scale the weights.

answered Feb 18 '20 at 20:30

Aksakal

55,939
5
90
176

Does this suggest that making similar scale adjustments to the learning rate would produce consistent results between the two models? Why or why not? – Sycorax Feb 20 '20 at 16:00
Learning rate regulates the step size, so I think you're right. – Aksakal Feb 20 '20 at 16:04

What is the difference between feature scaling and weight initialization scaling?

1 Answers1

Linked