2

This may be a silly idea. I have a huge dataset with billions of data points, and it would take a long time to run an epoch over all the data set. So

I was thinking of using the following strategy:

  1. Train a deep NN on a small sample of the data, constantly saving checkpoint models, and stop it when validation loss plateaus.
  2. Take the best model from step 1, append new data to the training set, and train again until validation loss plateaus.
  3. Repeat steps 1 and 2.

My intuition is as follows: perhaps I don't need to use all the data to train a model. Therefore, at every point, the model learns what it can from the small dataset (without overfitting), and once it starts to overfit we simply give it mor data. This strategy would ensure that the model takes only the minimum amount of data needed to ensure a good model, and would decrease training time.

Is this a coherent strategy? Would this be a faster way of training, or would it introduce some variance?

2 Answers2

3

You've more-or-less described a learning curve: a plot of the (average) performance of a model against varying amounts of training data. As you've suggested, at a certain point, additional data is not going to help you extract more signal from whatever phenomenon you're studying. Learning curves can be useful to make inferences about how much data is required for a specific problem before you reach a point of diminishing returns.

Learning curves do not include a step of "overfitting" the data, though. The way that a learning curve works is that for each size of training data, you repeat your procedure of model fitting (e.g. cross-validation to select regularization parameters) and report the best out-of-sample performance of the model selected by your procedure. You may repeat this step several times and take the average, using different training data each time. Then you repeat this process, for all training data sizes of interest.

Learning curves are described briefly in Elements of Statistical Learning and How large a training set is needed?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • thanks for your comment. You are correct. However, the purpose of my proposed strategy is to try and save myself some time when training. It seems like generating the learning curve (by training models with different training set sizes) would take up a lot of my time. – Lieu Zheng Hong Aug 13 '19 at 13:43
  • Suppose I do not want to generate a learning curve, but instead iteratively find the minimal number of examples needed to learn "efficiently". Would my method then be useful? – Lieu Zheng Hong Aug 13 '19 at 13:56
  • 1
    You’ll save time whenever you reach the point of diminishing returns on larger training data well before you’re training on the full data set. That is, if the point of diminishing returns occurs “early”, you can get away with training with a small amount of data. On the other hand, if you don’t study joe increasing amounts of data change the model on average, then you’re just flying blindly and don’t know if you’re using too much or too little data to have a certain level of average performance. – Sycorax Aug 13 '19 at 13:57
  • Your second comment, “iteratively find the minimal number of examples to learn efficiently,” is exactly a description of a learning curve, provided that “efficiently” means “achieve a given level of average performance.” – Sycorax Aug 13 '19 at 13:58
  • @LieuZhengHong Or we can look at this as an empirical question: have your experiments shown that your method is better (in some specific sense: faster, smaller data needs, etc) than ordinary learning curves? – Sycorax Aug 13 '19 at 14:26
  • I agree with what you said. To respond to your last comment: it seems like my method gets the key advantage of plotting a learning curve (using the least amount of data that achieves a certain level of performance), while not having to *actually* train models for the full number of epochs to plot a learning curve. Does this make sense? – Lieu Zheng Hong Aug 13 '19 at 15:22
  • Is your remark validated experimentally, or is it conjecture? It seems like the effect of your method is similar to training with a special kind of over-sampling, so your model will be very sensitive to whichever data it sees the most (as the data added last will be used the *least* to update the model). I would guess that this probably isn't desirable, because it increases variance of the learning procedure. – Sycorax Aug 13 '19 at 15:27
  • Complete conjecture, which is why I asked the question! "Would this be a faster way of training, or would it introduce some variance?" --- I wonder if you could expand upon the last bit of what you said in your comment (the last bit of data being used the least etc) into your answer, as this is exactly what I'm looking for --- possible problems of the strategy. – Lieu Zheng Hong Aug 13 '19 at 15:30
  • We often get questions of the form "I've invented a square wheel; is it a good idea?" And there are only two ways to answer that question. Either the square wheel already exists (in this case, a learning curve is very similar) or you'll have to do some work (carry out an experiment, write a proof) to show that the square wheel solves a problem. You'll have to do an experiment to figure out if your method is better than already-existing methods. I'd start with a literature review though; maybe someone already had this idea! – Sycorax Aug 13 '19 at 15:33
0

If your goal is to avoid training on a massive dataset, I don't see any advantages to this method over simply training on subsamples of your larger dataset. i.e. What is the advantage of training over a small, fixed subsample $X_F$ over 10 epochs rather than training on 10 different (random) subsamples of the same size $X_1 ... X_{10}$ ? The obvious disadvantage is that you expose your model to less data.

timchap
  • 171
  • 4