How to decrease training set size?

Question

I have a large training set, and it is too big to apply some algorithms due to computation limits.

What are the common methods to decrease training set size without losing significant amount of information?

Edit:

Training examples have 3 features and it is a 0/1 classification task.

How large is the training set? What algorithms do you want to apply? What software are you using? Are the three features categorical or continuous; if categorical, how many values can they each take on? Finally, are your 0/1 cases relatively balanced or is one class relatively rare? All these details may help people give you more useful advice. — Anne Z., Feb 05 '12 at 14:13
@metdos, could you please post a few lines of your data, and/or the results of training with say the first 100k, second 100k ... samples ? Also 3 is nice to visualize, google "visualize 3d point clouds". — denis, Feb 07 '12 at 18:26
Features are continuous, I have approximately 400K examples in training set. I tried SVM with Matlab. — metdos, Feb 07 '12 at 20:25
@metdos, linear kernel ? How good was the classification, how long did it run -- 400k x 3 doesn't seem large. — denis, Feb 08 '12 at 15:11
@Denis yes it is linear kernel, matlab gives me "Out of memory" error and on net it says it needs n^2 continues memory. Do you suggest any other tool or library(C++, Java, Python) ? — metdos, Feb 08 '12 at 18:47
@metdos, can't you train with 10k or 20k ? Then test all the rest (in linear time) to get a rough idea of how well X0 and X1 can be separated. In short, start small -- perhaps ask a separate question "how to start small with LDA and SVM". Apart from that, [scikit-learn SGD](http://scikit-learn.org/stable/modules/sgd.html) is fast. — denis, Feb 09 '12 at 13:40
@Denis, it hardly computes for 1K, and it finds approximately 500 support vectors. I think that SVM is not good choice if I didn't any mistakes. — metdos, Feb 09 '12 at 18:12
You have two point clouds P0 and P1 in 3d, ~ 200k points each, is that right ? If they overlap a lot, there's NO separating plane by any method (although there may well be separating non-planes). To look at the overlap visually, project P0 and P1 on the line from midpoint(P0) to midpoint(P1) and plot. Does that make sense, what are mid P0 and mid P1 for your data ? — denis, Feb 10 '12 at 10:55
@Denis They hugely overlap. What is the motivation behind projecting dots to this line ? Is there any theory related with that? — metdos, Feb 11 '12 at 19:23
@metdos, that's LDA, [Linear_discriminant_analysis](http://en.wikipedia.org/wiki/Linear_discriminant_analysis) -- no pictures, but make your own: plot a few 2d slices of your data, e.g. xy xz yz x+y+z (in lieu of a plot in 3d). Does that show why LDA / SVM can't separate P0 / P1 ? — denis, Feb 13 '12 at 10:23

score 4 · Answer 1 · answered Feb 05 '12 at 16:27

The brief answer is random sampling, but the more difficult issue is determining the size of the random sample that you should use. One efficient solution to that problem is provided by progressive sampling—a method that Foster Provost, Tim Oates, and I developed in the late 1990s [1]. The approach begins with a small sample size and increases sample size according to a sampling schedule, checking whether model accuracy increases at each iteration. We show that a geometric schedule (e.g., doubling the sample size on each iteration) is asymptotically no worse than knowing the correct sample size in advance.

[1] F. Provost, D. Jensen, and T. Oates (1999). Efficient progressive sampling. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. http://pages.stern.nyu.edu/~fprovost/Papers/progressive.ps

score 1 · Answer 2 · answered Feb 05 '12 at 10:12

I believe you need to be more specific. What are you trying to do/classify? How many classes do you have? Usually the training set is too valuable to dismiss. Have you thought of reducing the dimensionality, this is usually a must-do when you have many attributes and it will make computations much faster.

One thing I know people have done is to select x number of random data points from each class, but again it depends on the problem. You want to make sure you don't add bias to your new training set.

How to decrease training set size?

2 Answers2

Linked