2

Say I have 15 data points (x values), for which I have corresponding y values (say randomly generated). For learning purposes, I am trying design a neural network which perfectly matches the input data, i.e. overfits as much as possible. I cant get this done beyond certain point.

Running millions of epochs, designing networks having from 1 to about 4 layers, each layer having from 1 to thousands of neurons..., using ReLU or sigmoid to force non-linearity - all this and it still wont overfit. Am I missing something? Is it even possible when I have only one input?

For example, this is the output of NN 1,2048,2048,2048,1 with sigmoids after 200k epochs (practically no change since 60k)

graph

This actually does not look like overfitting at all. Confused, appreciate any pointers. Maybe I need some "funny" activation function?

Update
Thanks for all the comment suggestions. After digging in a bit more, here are my take-aways.

  • Use full batch, not mini-batch/SGD. I was however using that, if SGD in torch is fed the entire data set, it is full gradient descent.
  • My loss function was MSE, i.e. optimizer was taking mean of all the errors squared, for this purpose just summing the errors works better. Mean sort-of anti overfits here.
  • Some dataset might not be perfectly representable by a mathematical function, i.e. there might two identical data points with assigned different results (i.e. 3->9 and 3->7 due to randomness). Obviously this is not possible to capture by NN.
  • 1
    Can you [edit] your post to include your data, information about your architecture and learning procedure? – Sycorax Aug 26 '21 at 20:58
  • Just because the ANN fits the data extremely well (or even *perfectly*) does not mean the model is overfitting. That meaning said there are many types of ANN. Depending on the data and the expectations, you might choose one type over another. I would start there. Do you know what the best type of ANN to employ for your data is? – Kat Aug 26 '21 at 21:36
  • 2
    Use full batch instead of minibatch – Firebug Aug 26 '21 at 22:37
  • 1
    Also, what optimizer are you using? What is the learning rate? – Firebug Aug 26 '21 at 22:38
  • See here for an interactive demonstration: https://cs.stanford.edu/people/karpathy/convnetjs/demo/regression.html – Firebug Aug 27 '21 at 00:52
  • Using the sum instead of the mean for the loss is the same as adjusting the learning rate, so your discovery is that a different learning rate finds a different solution/finds the desired solution more quickly. https://stats.stackexchange.com/questions/358786/mean-or-sum-of-gradients-for-weight-updates-in-sgd/358971#358971 – Sycorax Aug 29 '21 at 19:30

0 Answers0