I build an MLP to classify instances of the fashion MNIST dataset. You can run/modify the code in this Google Colab Notebook.
When the features are downscaled by a factor of 255 (feature_scale_factor=255.0
) and the weights of the first dense layer are initialized with via glorot initialization with default settings (weight_scale_factor=1.0
) the network converges fast.
When the features are not downscaled (feature_scale_factor=1.0
) and the initialized weights are downscaled by a factor of 255 (weight_scale_factor=255.0
) the network does not converge (or rather converges extremly slow).
Stating Sycorax says Reinstate Monicas answer on this question,
If we apply scaling so that inputs are $X_{ij}\in [0,1]$, then activations for the first layer during the first iteration are are $$X\theta^{(1)} + \beta^{(1)}$$
and at convergence are $$X\theta^{(n)} + \beta^{(n)},$$ where the weights are $\theta$, the bias is $\beta$.
Network initialization draws values from some specific distribution, usually concentrated in a narrow interval around 0. If you don't apply scaling, then activations for the first layer during the first iteration are are
$$255\cdot X\theta^{(1)} + \beta^{(1)}$$
So the effect of multiplying by the weights is obviously 255 times as large.
should the convergence behaviour of the network not be the same in both scenarios?
Here is the code you'll find in the notebook:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.optimizers import SGD
# Get fmnist dataset
feature_scale_factor = 255.0 # Model converges with scale factor of 255.0
(X_opt, y_opt), (_, _) = fashion_mnist.load_data()
X_train, X_val = X_opt[:55000] / feature_scale_factor, X_opt[55000:] / feature_scale_factor
y_train, y_val = y_opt[:55000], y_opt[55000:]
fmnist_train = tf.data.Dataset.from_tensor_slices((X_train, y_train))
fmnist_train = fmnist_train.shuffle(5000).batch(32, drop_remainder=True)
fmnist_val = tf.data.Dataset.from_tensor_slices((X_val, y_val))
fmnist_val = fmnist_val.shuffle(5000).batch(32, drop_remainder=True)
print('\nDataset batch structure:')
print(fmnist_train.element_spec[0])
def my_glorot_initializer(shape, dtype=tf.float32):
weight_scale_factor = 1.0
stddev = tf.sqrt(2. / (shape[0] + shape[1]))
return tf.random.normal(shape, stddev=stddev, dtype=dtype)/weight_scale_factor
#Build Model
mlp = Sequential([
Flatten(input_shape=[28, 28], name='Flatten'),
Dense(300, activation='relu', kernel_initializer=my_glorot_initializer, name='Input_Layer'),
Dense(100, activation='relu', name='H1'),
Dense(10, activation='softmax', name='Output_Layer')
], name='MLP')
print()
mlp.summary()
mlp.save_weights('model.h5')
# Compile Model
mlp.compile(loss='sparse_categorical_crossentropy',
optimizer=SGD(learning_rate=0.1),
metrics=['accuracy'])
mlp.load_weights('model.h5') # reset model to initialization state
history = mlp.fit(fmnist_train,
epochs=2,
validation_data=fmnist_val)