Why does the loss/accuracy fluctuate during the training? (Keras, LSTM)

Question

I use LSTM network in Keras. During the training, the loss fluctuates a lot, and I do not understand why that would happen.

Here is the NN I was using initially:

And here are the loss&accuracy during the training:

(Note that the accuracy actually does reach 100% eventually, but it takes around 800 epochs.)

I thought that these fluctuations occur because of Dropout layers / changes in the learning rate (I used rmsprop/adam), so I made a simpler model:

I also used SGD without momentum and decay. I have tried different values for lr but still got the same result.

sgd = optimizers.SGD(lr=0.001, momentum=0.0, decay=0.0, nesterov=False)

But I still got the same problem: loss was fluctuating instead of just decreasing. I have always thought that the loss is just suppose to gradually go down but here it does not seem to behave like that.

So:

Is it normal for the loss to fluctuate like that during the training? And why it would happen?
If not, why would this happen for the simple LSTM model with the lr parameter set to some really small value?

Thanks. (Please, note that I have checked similar questions here but it did not help me to resolve my issue.)

Upd.: loss for 1000+ epochs (no BatchNormalization layer, Keras' unmodifier RmsProp):

Upd. 2: For the final graph:

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history = model.fit(train_x, train_y, epochs = 1500)

Data: sequences of values of the current (from the sensors of a robot).

Target variables: the surface on which robot is operating (as a one-hot vector, 6 different categories).

Preprocessing:

changed the sampling frequency so the sequences are not too long (LSTM does not seem to learn otherwise);
cut the sequences in the smaller sequences (the same length for all of the smaller sequences: 100 timesteps each);
check that each of 6 classes has approximately the same number of examples in the training set.

No padding.

Shape of the training set (#sequences, #timesteps in a sequence, #features):

(98, 100, 1)

Shape of the corresponding labels (as a one-hot vector for 6 categories):

(98, 6)

Layers:

The rest of the parameters (learning rate, batch size) are the same as the defaults in Keras:

keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=0.0)

batch_size: Integer or None. Number of samples per gradient update. If unspecified, it will default to 32.

Upd. 3: The loss for batch_size=4:

For batch_size=2 the LSTM did not seem to learn properly (loss fluctuates around the same value and does not decrease).

Upd. 4: To see if the problem is not just a bug in the code: I have made an artificial example (2 classes that are not difficult to classify: cos vs arccos). Loss and accuracy during the training for these examples:

What about introducing properly your problem (what is the research question you're trying to answer, describe your data, show your model, etc.)? You only show us your layers, but we know nothing about the data, the preprocessing, the loss function, the batch size, and many other details which may influence the result — DeltaIV, May 14 '18 at 07:40
Other things that can affect stability are sorting, shuffling, padding and all the dirty tricks which are needed to get mini-batch trained RNNs to work with sequences of widely variable length. The huge spikes you get at about 1200 epochs remind me of a case where I had to deal exactly with that. Who knows, maybe [attention is all you need](https://arxiv.org/abs/1706.03762) — DeltaIV, May 14 '18 at 07:58
Finally, if you use SGD you should probably **not** use a constant learning rate, but don't take my words on that because I'm only sure of that for CNNs applied to image classification. I don't know if the same holds for RNNs used for, what? time series prediction? NLP? You didn't tell us — DeltaIV, May 14 '18 at 08:06
Updated. Hope this is enough information to get a hint about what is going on. — Valeria, May 14 '18 at 15:25
Your last plot (1000 epochs loss) is a little crazy. Those spikes of loss. — Aksakal, May 14 '18 at 15:29
I know it is crazy. That is exactly why I am here: to understand why it is like this / how possibly to fix it. — Valeria, May 14 '18 at 16:27
This is better now, but we still miss details: how many sensors does your robot have? Do all of them measure continuous variables, or do some of them have discrete values (such as for example on/off-black/white)? Are there any missing values? You have a training set of size 98 (i.e., 98 sequences of length 100), and batches of size 32. They're not very small batches wrt the size of the training set...do you get better results if you use **real** mini-batches, i.e., something like `batch_size=4`? What about the size of the test set? Also, do I read correctly that you have only **one** sensor? — DeltaIV, May 14 '18 at 16:31
It may be easier if you add code (properly commented), but if you believe the issue is a **coding** error, rather than a statistical modeling issue, then the appropriate site is Stack Overflow. Right off the top of my head, a few things come to mind: 1) would [stateful layers](https://keras.io/layers/recurrent/) make sense in your case? If subsequences in different batches come from the same original sequence, it might. But there should be a one-to-one mapping btw samples (sequences) in following batches...also, your layers have a number of hidden units which is equal to the 1/ — DeltaIV, May 14 '18 at 16:45
2/ size of the subsequences (100). I usually use layers with _less_ hidden units than the length of the average sample...I don't know if this could be relevant. Finally, this can be an issue: **"changed the sampling frequency so the sequences are not too long (LSTM does not seem to learn otherwise)"**. How did you perform this downsampling? Did you just retain one sample out of every 100, say? Did you take the median of fixed duration windows? You may be discarding useful signal this way. I understand the need for shorter length sequences, but there are other solutions to achieve this. — DeltaIV, May 14 '18 at 16:48
One last comment: did you normalize/standardize the values of your sensor? Also, why not scrapping neural networks altogether? For a low-dimensional time series forecasting problem, classic statistical models may work better. In that case, you may want to ask a new question and include time series plots, lag plots, acf plots, etc. — DeltaIV, May 14 '18 at 17:36
1. The robot has many sensors but I only use the measurements of current. — Valeria, May 14 '18 at 20:51
2. I would say that it is not a coding error. I tried to feed the artificial data to check if the code works, and it works just fine. So the problem is really about the specific data/model/preprocessing... (see the last update for the graphs). — Valeria, May 14 '18 at 21:53
Why CNN? I think RNN/LSTM is a more common choice for the type of data I am using, even though CNNs are also used sometimes. Anyway, I am not trying to find the most precise model, I am just experimenting with LSTM. — Valeria, May 14 '18 at 22:00
The `batch_size=4` experiment seems incomplete. A moving average of the loss is surely lower wrt the `batch_size=32` (try using a `loess` smoother to convince yourself of this), but since in one case you stopped after 500 epochs and in the other after 1400, we can't tell for sure. But **what is your goal** actually? Do you want to get the best generalization error? Then to keep training forever (1400 epochs is forever, for such a low dimensional problem) makes no sense. Why don't you stop when validation error is minimum, as it's common practice? — DeltaIV, May 15 '18 at 09:06
ps when you reply to a comment, cite the commenter you're replying to with @DeltaIV (e.g.), so that we know that you replied. Concerning the coding error, your experiment is suggestive, but not enough: 1) deep neural networks have this bad habit of running to "some kind of convergence" even when there are actual bugs. [See examples](https://medium.com/@keeper6928/how-to-unit-test-machine-learning-code-57cf6fd81765). 2) You had to modify the code to perform the other tests. Did you perform all the same steps, including data preprocessing? I won't help more until I see the actual code. — DeltaIV, May 15 '18 at 09:15
@DeltaIV https://github.com/iegorval/neural_nets/blob/master/Untitled0.ipynb — Valeria, May 15 '18 at 10:29
@Valeria woah thanks! I have two questions about the code, can we talk in [chat](https://chat.stackexchange.com/rooms/info/77513/chat-deltaiv-and-valeria?tab=general)? We could also talk here, but 1) we're getting dangerously close to the maximum limit of comments and 2) maybe it's better if I ask you in chat, and then you decide if it's ok to share the answers here or not. — DeltaIV, May 15 '18 at 11:24
You have just one input, it seems, one sensor data stream. You're mapping this to one of a few surfaces. I just don't see why LSTM would be a natural choice. I'd even start with a univariate time series methods as a benchmark. E.g. get the spectrum with FFT, and analyze it with CNN and try to classify. — Aksakal, May 15 '18 at 16:55

score 11 · Answer 1 · edited May 13 '21 at 14:25

There are several reasons that can cause fluctuations in training loss over epochs. The main one though is the fact that almost all neural nets are trained with different forms of stochastic gradient descent. This is why batch_size parameter exists which determines how many samples you want to use to make one update to the model parameters. If you use all the samples for each update, you should see it decreasing and finally reaching a limit. Note that there are other reasons for the loss having some stochastic behavior.

This explains why we see oscillations. But in your case, it is more that normal I would say. Looking at your code, I see two possible sources.

Large network, small dataset: It seems you are training a relatively large network with 200K+ parameters with a very small number of samples, ~100. To put this into perspective, you want to learn 200K parameters or find a good local minimum in a 200K-D space using only 100 samples. Thus, you might end up just wandering around rather than locking down on a good local minima. (The wandering is also due to the second reason below).
Very small batch_size. You use very small batch_size. So it's like you are trusting every small portion of the data points. Let's say within your data points, you have a mislabeled sample. This sample when combined with 2-3 even properly labeled samples, can result in an update which does not decrease the global loss, but increase it, or throw it away from a local minima. When the batch_size is larger, such effects would be reduced. Along with other reasons, it's good to have batch_size higher than some minimum. Having it too large would also make training go slow. Therefore, batch_size should be treated as a hyperparameter.

score 2 · Answer 2 · answered May 13 '18 at 18:08

Your loss curve doesn't look so bad to me. It should definitely "fluctuate" up and down a bit, as long as the general trend is that it is going down - this makes sense.

Batch size will also play into how your network learns, so you might want to optimize that along with your learning rate. Also, I would plot the entire curve (until it reaches 100% accuracy/minimum loss). It sounds like you trained it for 800 epochs and are only showing the first 50 epochs - the whole curve will likely give a very different story.

You are right. Besides, after I re-run the training, it is even less stable than it was, so I am almost sure I am missing some error. I have updated the post with the training for 1000+ epochs. — Valeria, May 14 '18 at 07:29

score 0 · Answer 3 · answered Dec 30 '18 at 10:45

The fluctuations are normal within certain limits and depend on the fact that you use a heuristic method but in your case they are excessive. Despite all the performance takes a definite direction and therefore the system works. From the graphs you have posted, the problem depends on your data so it's a difficult training. If you have already tried to change the learning rate try to change training algorithm. You would agree to test your data: first compute the Bayes error rate using a KNN (use the trick regression in case you need), in this way you can check whether the input data contain all the information you need. Then try the LSTM without the validation or dropout to verify that it has the ability to achieve the result for you necessary. If the training algorithm is not suitable you should have the same problems even without the validation or dropout. Just at the end adjust the training and the validation size to get the best result in the test set. Statistical learning theory is not a topic that can be talked about at one time, we must proceed step by step.

Why does the loss/accuracy fluctuate during the training? (Keras, LSTM)

3 Answers3

Linked

Related