When is EarlyStopping really neccessary?

Question

I have trained a CNN with EarlyStopping and I wonder if I should not use EarlyStopping and waste 20% of Trainingsdata for Validation, because it looks like as that the validation loss doesn't increase after 50 Trainingsepochs (please see the image).

Sorry for this simple questions but I'm a beginner and I try to understand when EarlyStopping is really necessary and when it is superfluos to use EarlyStopping.

How are you splitting your data at the moment? 80% training and 20% test? — Janosch, Jan 14 '20 at 09:17
@Tim The problem is, that the training data set is very small, only 500 images. — Code Now, Jan 14 '20 at 10:42
A more common split is 80%|10%|10%. So you can retain the same amount training samples, but you validate on less — Janosch, Jan 14 '20 at 11:44
@Tim If I understand that correctly, you find according to the learning curve shown above that EarlyStopping should still be used? But only with 10% for validation. — Code Now, Jan 14 '20 at 11:52
Well this is a general problem and is related to EarlyStoppping. You do not want to train your model for two long, as at some point your training loss will decrease but your val-loss remains the same or increases again. So you want to stop training at the point where your validation loss does not improve anymore. After that you will start overfitting. You can avoid this by implementing an early stopping rule. — Janosch, Jan 14 '20 at 11:57
What are the y-axis units, i.e. what loss functions are plotted? I suspect the upper is % correct. For that, 80 correct out of 100 tested images gives a 95 % confidence interval of 71 -87 %. I.e. you cannot be sure to see any real change after epoch ≈ 5 because of too small validation sample size. This would be even worse with a 10 % split: the confidence interval for 40 correct of 50 tested is 67 - 89 %. — cbeleites unhappy with SX, Jan 14 '20 at 16:45
@cbeleitessupportsMonica The upper plot is the accuracy in percent and the lower is categorical_cross_entropy" — Code Now, Jan 14 '20 at 17:54

score 4 · Accepted Answer · answered Jan 14 '20 at 12:02

4

One common way of splitting the data is into the 80%,10%,10%. EarlyStopping is used to prevent the model from overfitting. You could also do the "EarlyStopping" by hand.

You could run the model see at what point you overfit and then choose the model from the appropriate epoch (for which you need to save those models while training). The usage of EarlyStopping just automates this process and you have additional parameters such as "patience" with which you can adapt the earlystopping rules.

In your example you train your model for too long. You should definitely stop training the latest at epoch 30 where after the validation loss start to increase again. BUt you could already stop at epoch 10 as your loss only improves really slowly.

EarlyStoppping rules just help to automate this detection. But in general you should always stop training when the validation error increases.

It can be helpful to not only split 80:20 (but 80:10:10) because deciding to stop training based on the validation set can also overfit to the validationset.

answered Jan 14 '20 at 12:02

Janosch

530
2
10

Ok, the setting of patience should also depend on the learning rate? When I have a very low learning rate e.g. optimizer Adam with lr=0.0001, then would I have to assign a higher value to patience (e.g. 20 epochs or higher)? Or better, if I have enough time, and if I have set 'restore_best_weights=True' (Keras EarlyStopping) then I could also choose a higher value for patience? – Code Now Jan 14 '20 at 12:13
While it makes sense to increase the patience when the learning rate is increased, they are not related and you should try to find a suitable lr independently of the patience you are using. To be honest I do not use keras that much so I am not sure how "restore_best_weights" is used. but I do know there is another module which you can add to your callbakcs which stores the best model – Janosch Jan 14 '20 at 12:24
Maybe you allow me one more question. If you would like to optimize hyperparameters such as learning rate and dropout rate via grid search, would you then use a fixed number of epochs? – Code Now Jan 14 '20 at 15:15
I am not sure what the consensus is here. But I would not (unless you care about speed and you want to train a model in a limited number of epochs), I would use the EarlyStopping that way you can guarantee that all models with different hyperparameters train to the "optimal" loss. In case of hyperparameter tuning you would need a validation and a test set. For example 80:10:10 – Janosch Jan 16 '20 at 14:22
Y. Bengio recommend in his paper "Practical REcommendations for Gradient-Based Training of Deep Architectures" https://arxiv.org/abs/1206.5533 (see page 9-10) that it is useful to turn early-stopping off when analyzing the effect of individual hyper-parameters, because early stopping can hide overfitting effect of other hyperparameters. Therefore I wonder if this couldn't also happen via grid search when using early stopping for each model. – Code Now Jan 16 '20 at 15:06
He is not really clear in what context he wants to analyze the effect of an individual hyp.param.. I think if you interested in the general effect of an hyperparameters on training than it makes sense to use a (large) common epoch size. But from my point of view, I would say using a fixed a number of epochs in the optimization process, does not optimize for the best model, but for the best model given the limited epochs you have. If speed is important than it might make sense to limit the training time. – Janosch Jan 17 '20 at 07:42
Here is a similar issue, but here part of the ES is also a hyperparameter https://stats.stackexchange.com/questions/422671/early-stopping-together-with-hyperparameter-tuning-in-neural-networks – Janosch Jan 17 '20 at 07:42
Yes, I think you're right to use Early stopping in grid search. It seems to make more sense, especially when optimize the learning rate. If I understand that correctly, then after each split in training and validation data, a part of the validation data would have to be used for early stopping. By that I mean the validation data for early stopping should change during each iteration in grid search, right? So that a different validation data set is used for early stopping each time, right? – Code Now Jan 17 '20 at 20:23

When is EarlyStopping really neccessary?

1 Answers1