1

When training a model it is more and more common to augment data. posts indicate that only the training set shall be augmented. On the other hand it is common to split dataset in a fashion following ratios like 70% (train), 15%(validation), 15% (test)

My question is:

  • When using augmentation techniques, shall this ratio still be respected after augmentation (meaning that the number of items included in validation and test deternimnes the augmentation ratio)
  • or does the dataset shall be split before augmentation process (meaning that dataset ratios are unbalanced) ?

Any publications regarding this topic?

tn3m3lc
  • 11
  • 1

1 Answers1

0

Using your ratios, it would go something like this.

  1. Start with a million images.

  2. Select aside $700,000$ images for training, another $150,000$ for validation, and the final $150,000$ for test.

  3. Augment the $700,000$ training images to give yourself $700,001$ or $700,000,000$ images in the training set.

  4. Train on the $700,001$ or $700,000,000$ training images.

  5. Validated and test your model on the validation set of $150,000$ and test set of $150,000$, respectively.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • Not sure that I understood the answer (sorry..) Is that the way I should build my augmentation? Or are you pointing an aberration in my question (which i don't see)? – tn3m3lc Oct 20 '21 at 19:51
  • You have your data. Create a training set. Now pretend nothing but the training set exists. Augment the training set; train your model on the augmented data. Validate and test on the holdout data sets. – Dave Oct 20 '21 at 19:53