Approaches when learning from huge datasets?

Question

Basically, there are two common ways to learn against huge datasets (when you're confronted by time/space restrictions):

Cheating :) - use just a "manageable" subset for training. The loss of accuracy may be negligible because of the law of diminishing returns - the predictive performance of the model often flattens out long before all the training data is incorporated into it.
Parallel computing - split the problem into smaller parts and solve each one on a separate machine/processor. You need a parallel version of the algorithm though, but good news is that a lot of common algorithms are naturally parallel: nearest-neighbor, decision trees, etc.

Are there other methods? Is there any rule of thumb when to use each? What are the drawbacks of each approach?

score 10 · Accepted Answer · answered Feb 18 '12 at 09:13

Stream Mining is one answer. It is also called:

Data Stream Mining
Online Learning
Massive Online Learning

Instead of putting all data set in memory and training from it. They put chunks of it in memory and train classifier/clusters from these stream of chunks. See following links.
Data_stream_mining from wikipedia.
MOA: Massive Online Analysis
- Article
- Tool, written in Java, able to use weka algorithms
- Book
Mining of Massive Datasets Book , From Stanford University. It uses MapReduce as a tool.
Videos in videolectures.net. Search it similar videos exists in that site.
- State of the Art in Data Stream Mining
- Mining Massive Data Sets

agreed, the MOA toolbox would be a good place to start – tdc Feb 20 '12 at 08:46 — tdc, Feb 20 '12 at 08:46

score 7 · Answer 2 · answered Feb 16 '12 at 07:47

7

Instead of using just one subset, you could use multiple subsets as in mini-batch learning (e.g. stochastic gradient descent). This way you would still make use of all your data.

answered Feb 16 '12 at 07:47

Lucas

5,692
29
39

Aha that's a good point - I clarified the question. I'm interested in a scenario when you're confronted with time/space restrictions and "cannot afford" mini-batch learning. – andreister Feb 16 '12 at 07:54

score 1 · Answer 3 · answered Feb 23 '12 at 10:57

1

Ensembles like bagging or blending -- no data is wasted, the problem automagically becomes trivially parallel and there might be significant accuracy/robustness gains.

answered Feb 23 '12 at 10:57

Approaches when learning from huge datasets?

3 Answers3

Linked