10

Basically, there are two common ways to learn against huge datasets (when you're confronted by time/space restrictions):

  1. Cheating :) - use just a "manageable" subset for training. The loss of accuracy may be negligible because of the law of diminishing returns - the predictive performance of the model often flattens out long before all the training data is incorporated into it.
  2. Parallel computing - split the problem into smaller parts and solve each one on a separate machine/processor. You need a parallel version of the algorithm though, but good news is that a lot of common algorithms are naturally parallel: nearest-neighbor, decision trees, etc.

Are there other methods? Is there any rule of thumb when to use each? What are the drawbacks of each approach?

Andre Silva
  • 3,070
  • 5
  • 28
  • 55
andreister
  • 3,257
  • 17
  • 29

3 Answers3

10

Stream Mining is one answer. It is also called:

Atilla Ozgur
  • 1,251
  • 1
  • 11
  • 17
7

Instead of using just one subset, you could use multiple subsets as in mini-batch learning (e.g. stochastic gradient descent). This way you would still make use of all your data.

Lucas
  • 5,692
  • 29
  • 39
  • Aha that's a good point - I clarified the question. I'm interested in a scenario when you're confronted with time/space restrictions and "cannot afford" mini-batch learning. – andreister Feb 16 '12 at 07:54
1

Ensembles like bagging or blending -- no data is wasted, the problem automagically becomes trivially parallel and there might be significant accuracy/robustness gains.