18

Is it common practice to apply data augmentation to training set only, or to both training and test sets?

rodrigo-silveira
  • 1,138
  • 3
  • 12
  • 16
  • I thought test time augmentation is pretty common these days. I first read about it in the 2012 paper here: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf – Ryan Zhang Jan 09 '19 at 17:10

3 Answers3

15

In terms of the concept of augmentation, ie making the data set bigger for some reason, we'd tend to only augment the training set. We'd evaluate the result of different augmentation approaches on a validation set.

However, as @Łukasz Grad points out, we might need to perform a similar procedure to the test set as was done on the training set. This is typically so that the input data from the test set resembles as much as possible that of the training set. For example, @Łukasz Grad points out the example of image cropping, where we'd need to crop the test images too, so they are the same size as the training images. However, in the case of the training images, we might use each training image multiple times, with crops in different locations/offsets. At test time we'd likely either do a single centred crop, or do random crops and take an average.

Running the augmentation procedure against test data is not to make the test data bigger/more accurate, but just to make the input data from the test set resemble that of the input data from the training set, so we can feed it into the same net (eg same dimensions). We'd never consider that the test set is 'better' in some way, by applying an augmentation procedure. At least, that's not something I've ever seen.

On the other hand, for the training set, the point of the augmentation is to reduce overfitting during training. And we evaluate the quality of the augmentation by then running the trained model against our more-or-less fixed test/validation set.

Hugh Perkins
  • 4,279
  • 1
  • 23
  • 38
  • would you agree that model selection and other hp selection should only be done on the val set (which most likely would benefit from augmentation)? The test set should remain untouched for selection purposes and very very carefully used likely only to report test errors. Do you agree or what do you think? I do think augmenting the test set can't hurt except perhaps it takes longer to evaluate the model test error and makes it less deterministic. – Charlie Parker Dec 17 '21 at 18:53
  • I also want to add that by not augmenting the test set it can be a very good idea because they you see if the augmentation really helped generalize - since you are using the original test set. In a way the augmented examples aren't "actually" in the data set so putting them in the test set might make the testing a "harder challenge" but makes it harder to know if the augmentation is helping (since you also changed the test set). I'd argue that it's better to leave the test set untouched and augment the train & val sets. – Charlie Parker Dec 17 '21 at 18:55
  • Though leaving the val set untouched might make your generalization to the test set better if the test set is also untouched (since both are not augmented - so you're selecting a model that is more similar to what you will see in real testing). Augmenting the train set is likely always good (as long as it helps improve results in your domain). – Charlie Parker Dec 17 '21 at 18:56
2

Typically, data augmentation for training convolutional neural networks is only done to the training set. I'm not sure what benefit augmenting the test data would achieve as the value of test data is primarily for model selection and evaluation and you're adding noise to your measurement of those quantities.

MachineEpsilon
  • 2,686
  • 1
  • 17
  • 29
  • 1
    I diasagree, eg. most papers using imagenet dataset train and test their classifier with random cropping, which is a form of augmentation – Łukasz Grad Dec 31 '17 at 02:20
  • I certainly could be wrong, do you mind providing a reference? A quick sample of some papers like AlexNet https://www.nvidia.cn/content/tesla/pdf/machine-learning/imagenet-classification-with-deep-convolutional-nn.pdf, Resnet https://arxiv.org/pdf/1512.03385.pdf, and YOLO9000 https://arxiv.org/pdf/1612.08242.pdf and it seems like none of these do augmentation on the test set (so far as I can tell). – MachineEpsilon Dec 31 '17 at 02:41
  • 3
    In a sense I think you're both right: if a net was trained with random crop, the test images will tend to be cropped too. But they might not be a random crop: they might be a centre crop. But not always. I'm not really sure this is 'augmentation' of the test set as such, so much as ensuring the distribution of the input data in the test set somewhat matches that of the training set. But that's semantics really: from a technical point of view, one might need to do *something* to the test set so that it resembles the training set, similar to how dropout works at test time. – Hugh Perkins Dec 31 '17 at 02:55
  • 1
    Yes, that makes sense. As best as I can see cropping is a special case because it effects the model architecture by changing the size of the input layer and other augmentation transformations (such as adding noise, reflections, blurring) do not. – MachineEpsilon Dec 31 '17 at 04:52
  • 1
    @Machineepsilon Here is the first example I could find from inception pape, table 4: https://arxiv.org/pdf/1512.00567.pdf – Łukasz Grad Dec 31 '17 at 09:09
  • The test set should always remain untouched. Do hyperparameter tuning e.g. model selection on a validation set. The val set could have augmentation to help select these hps better but it's unclear how well this would help. The train set for sure needs them. – Charlie Parker Dec 17 '21 at 18:47
1

Complementing the answers, let my add my 2 cents regarding test-time data augmentation.

Data augmentation can be also performed during test-time with the goal of reducing variance. It can be performed by taking the average of the predictions of modified versions of the input image.

Dataset augmentation may be seen as a way of preprocessing the training set only. Dataset augmentation is an excellent way to reduce the generalization error of most computer vision models. A related idea applicable at test time is to show the model many different versions of the same input (for example, the same image cropped at slightly different locations) and have the different instantiations of the model vote to determine the output. This latter idea can be interpreted as an ensemble approach, and it helps to reduce generalization error. (Deep Learning Book, Chapter 12).

It's a very common practice to apply test-time augmentation. AlexNet and ResNet do that with the 10-crop technique (taking patches from the four corners and the center of the original image and also mirroring them). Inception goes further and generate 144 patches instead of only 10. If you check Kaggle and other competitions, most winners also apply test-time augmentation.

I'm the author of a paper on data augmentation (code) in which we experimented with training and testing augmentation for skin lesion classification (a low-data task). In some cases, using strong data augmentation on training alone is marginally better than not using data augmentation, while using train and test augmentation increases the performance of the model by a very significant margin.

Fábio Perez
  • 153
  • 6
  • Just want to one thing here. If I apply rotation of 5 different angles and random cropping on the entire dataset and than divided into training testing and validation. Will it totally incorrect evaluation of dataset? – Aadnan Farooq A Jun 14 '19 at 09:24
  • @AadnanFarooqA Yes, your validation set will be contaminated. – Fábio Perez Jun 14 '19 at 18:23
  • So what will be the best data arrangement. data augmentation on training data.. validation and testing remaining as original data. if its that so.. lets say training data will be after data augmentation will be 10000 images and validation and testing will be 500 images each? – Aadnan Farooq A Jun 15 '19 at 01:51
  • if you augment the test set does that not confuse (make harder to compare) the test set report values you report in the final paper? (also, model selection, hps selection should only be done on the val set, test set should always remain untouched as much as possible). – Charlie Parker Dec 17 '21 at 18:49
  • You shouldn't be using the test set to choose models, so if you augment the test set should not make a difference. So your paper confuses me a little... – Charlie Parker Dec 17 '21 at 19:00