1

I am trying to write a simple NN based regressor, and I have noticed that if i take two identical NN, one with mean square error loss, ane with sample drawn as gaussian prior over final output, with negative log likelihood loss (nll), the nll loss performs significantly better. My understanding it, nll loss with Gaussian priors are same as MSE so shall the output error be not similar?

######
# MSE loss over NN
######
model = tf.keras.Sequential([
tf.keras.layers.Dense(800,input_shape=(371,),kernel_regularizer='l2'),
tf.keras.layers.LeakyReLU(alpha=0.1),
tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(1000,kernel_regularizer='l2'),
tf.keras.layers.LeakyReLU(alpha=0.1),
tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(1)
])
model.compile(loss=MeanSquareError(), optimizer=tf.keras.optimizers.Adam(lr=0.0001),metrics=['mae'])


######
#nll loss over Normal drawn of final mean and sigma predictions
######

model = tf.keras.Sequential([
tf.keras.layers.Dense(800,input_shape=(371,),kernel_regularizer='l2'),
tf.keras.layers.LeakyReLU(alpha=0.1),
tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(1000,kernel_regularizer='l2'),
tf.keras.layers.LeakyReLU(alpha=0.1),
tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(2),
tfp.layers.IndependentNormal(1)
])
model.compile(loss=lambda y_t,y_p: -y_p.log_prob(y_t)
 , optimizer=tf.keras.optimizers.Adam(lr=0.0001),metrics=['mae'])

Is there any reason why nll loss will perform better? Are they not equivalent in my example above?

PS: First model was trained using MSE loss, second model was trained using NLL loss, for comparison between the two, after the training, MAE and RMSE of predictions on a common holdout set was performed.

In sample Loss and MAE:

  1. MSE loss: loss: 0.0450 - mae: 0.0292, Out of sample: 0.055
  2. NLL loss: loss: -2.8638e+00 - mae: 0.0122, Out of sample: 0.050
  3. Kernel ridge regression Out of sample: 0.0575
ipcamit
  • 167
  • 9
  • Do you train one mode on log loss and another on MSE and then test each model using MAE as the metric? Please add this to the question ; not everyone knows Keras code. – Dave Feb 08 '21 at 16:11
  • Yes. I have added the required info as suggested – ipcamit Feb 08 '21 at 16:32
  • 1
    With no regularization of the mean and covariance vectors (à la VAEs) there is nothing precluding your network from assigning quasi-zero variance to that normal sampler. – Firebug Feb 08 '21 at 17:09
  • You're comparing on RMSE? The model trained on MSE better have superior (in-sample) performance on RMSE, since MSE and RMSE are equivalent loss functions. (That issue of in-sample vs out-of-sample performance also is worth discussing. On which data are you evaluating your model, training data or a holdout set? Again, please include that in the question; not everyone reads comments!) – Dave Feb 08 '21 at 17:35
  • added in sample and out of sample MAE errors for both, and comparison with KRR. @Firebug I am sorry I did not fully understand, can you please elaborate a bit? perhaps point to relevant introductory literature? – ipcamit Feb 09 '21 at 04:26
  • This site has a number of questions on VAEs. See https://stats.stackexchange.com/q/321841/60613 – Firebug Feb 09 '21 at 13:25

1 Answers1

1

I finally found the answer in this paper https://ieeexplore.ieee.org/document/374138 , also explained and referenced in a blogpost.

It mentions clearly in the paper that NLL performs better then MSE because the loss function becomes:

$$ NLL = \sum log(\sigma^2(x_i))/2 + (\mu(x_i) - y_i)^2/2\sigma^2(x_i) $$ Now if i assume $\sigma$ to be constant then my loss function becomes equivalent to MSE * const. which was the basis of my original comment

... nll loss with Gaussian priors are same as MSE ...

However in current network $\sigma$ is a variable, and hence the network gives higher weight to datapoints with lower variance. Resulting in improved learning in current case.

But the paper mentioned that if datasets are not large enough then NLL results in over fitting, which can again be explained on the same basis. I am including the relevant part of the paper below:

Results and discussion from ref

ipcamit
  • 167
  • 9