3

I have a large training set, and it is too big to apply some algorithms due to computation limits.

What are the common methods to decrease training set size without losing significant amount of information?

Edit:

Training examples have 3 features and it is a 0/1 classification task.

chl
  • 50,972
  • 18
  • 205
  • 364
metdos
  • 259
  • 2
  • 10
  • 2
    How large is the training set? What algorithms do you want to apply? What software are you using? Are the three features categorical or continuous; if categorical, how many values can they each take on? Finally, are your 0/1 cases relatively balanced or is one class relatively rare? All these details may help people give you more useful advice. – Anne Z. Feb 05 '12 at 14:13
  • @metdos, could you please post a few lines of your data, and/or the results of training with say the first 100k, second 100k ... samples ? Also 3 is nice to visualize, google "visualize 3d point clouds". – denis Feb 07 '12 at 18:26
  • Features are continuous, I have approximately 400K examples in training set. I tried SVM with Matlab. – metdos Feb 07 '12 at 20:25
  • @metdos, linear kernel ? How good was the classification, how long did it run -- 400k x 3 doesn't seem large. – denis Feb 08 '12 at 15:11
  • @Denis yes it is linear kernel, matlab gives me "Out of memory" error and on net it says it needs n^2 continues memory. Do you suggest any other tool or library(C++, Java, Python) ? – metdos Feb 08 '12 at 18:47
  • @metdos, can't you train with 10k or 20k ? Then test all the rest (in linear time) to get a rough idea of how well X0 and X1 can be separated. In short, start small -- perhaps ask a separate question "how to start small with LDA and SVM". Apart from that, [scikit-learn SGD](http://scikit-learn.org/stable/modules/sgd.html) is fast. – denis Feb 09 '12 at 13:40
  • @Denis, it hardly computes for 1K, and it finds approximately 500 support vectors. I think that SVM is not good choice if I didn't any mistakes. – metdos Feb 09 '12 at 18:12
  • You have two point clouds P0 and P1 in 3d, ~ 200k points each, is that right ? If they overlap a lot, there's NO separating plane by any method (although there may well be separating non-planes). To look at the overlap visually, project P0 and P1 on the line from midpoint(P0) to midpoint(P1) and plot. Does that make sense, what are mid P0 and mid P1 for your data ? – denis Feb 10 '12 at 10:55
  • @Denis They hugely overlap. What is the motivation behind projecting dots to this line ? Is there any theory related with that? – metdos Feb 11 '12 at 19:23
  • @metdos, that's LDA, [Linear_discriminant_analysis](http://en.wikipedia.org/wiki/Linear_discriminant_analysis) -- no pictures, but make your own: plot a few 2d slices of your data, e.g. xy xz yz x+y+z (in lieu of a plot in 3d). Does that show why LDA / SVM can't separate P0 / P1 ? – denis Feb 13 '12 at 10:23

2 Answers2

4

The brief answer is random sampling, but the more difficult issue is determining the size of the random sample that you should use. One efficient solution to that problem is provided by progressive sampling—a method that Foster Provost, Tim Oates, and I developed in the late 1990s [1]. The approach begins with a small sample size and increases sample size according to a sampling schedule, checking whether model accuracy increases at each iteration. We show that a geometric schedule (e.g., doubling the sample size on each iteration) is asymptotically no worse than knowing the correct sample size in advance.


[1] F. Provost, D. Jensen, and T. Oates (1999). Efficient progressive sampling. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. http://pages.stern.nyu.edu/~fprovost/Papers/progressive.ps

1

I believe you need to be more specific. What are you trying to do/classify? How many classes do you have? Usually the training set is too valuable to dismiss. Have you thought of reducing the dimensionality, this is usually a must-do when you have many attributes and it will make computations much faster.

One thing I know people have done is to select x number of random data points from each class, but again it depends on the problem. You want to make sure you don't add bias to your new training set.

Roronoa Zoro
  • 133
  • 8