Why is this toy example so difficult for neural net to learn? My guess is that the output of the first hidden layer is not normalized, so propagated gradient is not very stable. I've tried adding BatchNormalization
between the two linear layers, but it has no visible effect on the optimization.
UPD: It seems like this particular behavior is caused by not scaled target variable. This matches my experiments, which showed that the learning occurs only with very small learning rate.
Example code:
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
x = np.arange(100)[:, np.newaxis]
f = lambda x: x ** 2 + 3
y = f(x)
x_normed = (x - x.mean()) / x.std()
km = Sequential()
km.add(Dense(1, activation=None))
km.add(Dense(1, activation=None))
km.compile('sgd', loss='mse')
km.fit(x_normed, y, epochs=300)