Dealing with LSTM overfitting

Question

I'm carrying out a project of predicting time series data with an LSTM. I tried out the experiment three times with randomly sampled data(about 920,000 lines each)

I've stacked 3 layers of LSTM cells, used l1(0.01) regularization, used dropout, tried shuffling the dataset for every epoch, used ADAM optimizer..

but I get the error curve as follows, which seems to signify overfitting

x-axis : epochs

y-axis : error in terms of mean squared error

the blue line indicates the test set, and the orange train set

Can somebody give suggestions on what I should try? Maybe it's a matter on the dataset itself?

What do the plots show? Please label your data and explain what you think it means. — Sycorax, Jun 17 '18 at 02:09
Would you articulate how you made the validation set? What percent of the total data is it? Is it duplicated in the training set? Is there variables upon which it should selectively sub-sampled (stratified)? The first plot does not look like anything is failing. I have had NN's take 20k iterations to start getting close to reasonable fit, so 10 steps is really very few epochs. Also the error rates of interior layers may take longer to converge than edge weights because the information at the edge nodes takes longer to effectively diffuse inward. — EngrStudent, Jun 28 '18 at 11:57
We have a thread collecting general suggestions to improve a neural network when it doesn't generalize well. See: https://stats.stackexchange.com/questions/365778/what-should-i-do-when-my-neural-network-doesnt-generalize-well — Sycorax, Jun 25 '20 at 16:50

Sycorax · Answer 1 · 2018-06-17T13:54:20.067

9

Yeah, that’s overfitting because the test error is much larger than the training error.

Three stacked LSTMs is hard to train. Try a simpler network and work up to a more complex one. Keep in mind that the tendency of adding LSTM layers is to grow the magnitude of the memory cells. Linked memory-forget cells enforce memory convexity and make it easier to train deeper LSTM networks.

Learning rate tweaking or even scheduling might also help.

In general, fitting a neural network involves a lot of experimentation and refinement. Finding the best network involves tuning a lot of dials together.

edited Jun 17 '18 at 13:54

answered Jun 17 '18 at 02:39

Sycorax

76,417
20
189
313

cool, I've never thought 3 stacked layers could be the cause of overfitting. I'd better try that out first. Thx a ton! – HyeongGyu Froilan Choi Jun 17 '18 at 03:04
More parameters means more model capacity and More model capacity can cause overfitting. – Sycorax Jun 17 '18 at 03:21

Felipe Mello · Accepted Answer · 2021-06-24T21:33:43.740

Your NN is not necessarily overfitting. Usually, when it overfits, validation loss goes up as the NN memorizes the train set, your graph is definitely not doing that. The mere difference between train and validation loss could just mean that the validation set is harder or has a different distribution (unseen data). Also, I don't know what the error means, but maybe 0.15 is not a big difference, and it is just a matter of scaling.

As a suggestion, you could try a few things that worked for me:

Add a small dropout to your NN (start with 0.1, for example);
You can add dropout to your RNN, but it is trickier, you have to use the same mask for every step, instead of a random mask for each step;
You could experiment with NN size, maybe the answer is not making it smaller, but actually bigger, so your NN can learn more complex functions. To know if it is underfitting or overfitting, try to plot predict vs real;
You could do feature selection/engineering -- try to add more features or remove the ones that you might think that are just adding noise;
If your NN is simply input -> rnn layers -> output, try adding a few fully connected layers before/after the rNN, and use MISH as an activation function, instead of ReLU;
For the optimizer, instead of Adam, try using Ranger.
The problem could be the loss function. Maybe your labels are very sparse (a lot of zeros), and the model learns to predict all zeros (sudden drop in the beginning) and cant progress further after that. To solve situations like that you can try different metric, like pos_weight on BCE, dice loss, focal loss, etc.

Good luck!

Dealing with LSTM overfitting

2 Answers2

Linked