I am training an SVM on highly imbalanced data. I have rectified this issue and my ML pipeline works just fine. I have allocated 70% of my dataset for training, however this takes an infeasible amount of time to compute. I read posts about how performance doesn't necessarily grow with the amount of data and sometimes the entire dataset doesn't need to be used. So my question is, could just a sample of the training data be used to train on? And if so, what is a reasonable proportion of that data?
2 Answers
A random sample of your data will necessarily contain less information than the complete dataset. Depending on how your covariates are distributed, it is possible that this loss could have a minimal impact on your fit, but it could also have a dramatic impact.
Consider a linear regression on a single, one dimensional covariate. $y_i ~ N(x_ib, \sigma^2)$. If the data feature four observations, $x_1 = x_2 = 0$ and $x_3 = x_4 = 10$, then perhaps $x_1$ and $x_2$ could be dropped from the data without much harm done on the inference of $b$ or prediction task of $y_{new}$ for $x_{new} \in (0,10)$. However, if instead you dropped $x_3$ and $x_4$, all hope of meaningful inference for $b$ would be lost.
In your case, the answer is probably less obvious than this toy regression example. The easiest way to test how much value you are getting from your observations is to test it empirically. Fit your model on a few random 5% subsets of the data, each time checking predictive accuracy on held out observations. Then fit your model on a few random 20% subsets of your data, again checking predictive accuracy each time. If you are getting much higher predictive accuracy for the models trained on the larger cut of your data, you probably need the extra data. Rinse and repeat until your predictive accuracy is satisfying or you run out of computational resources.
If you keep finding the extra data to be valuable when you run out of computational resources, consider fitting a simpler model (like linear regression or LARS) on the complete data. The extra data may or may not be worth the reduced model flexibility.
In case you're interested, I know of certain machine learning researchers (Tamara Broderick at MIT for instance) doing interesting work trying to find "best subsets" of data for problems like this. See this conference paper on coresets https://papers.nips.cc/paper/6486-coresets-for-scalable-bayesian-logistic-regression.pdf.

- 51
- 4
I assume this question will be helpful
Can support vector machine be used in large data?
Specifically, using SVM with large amount (does not need to be large, even 100K will make it stop working) of data is almost infeasible, because of the size of the kernel matrix.
The next question will be why SVM or can you get a good representative sample from the large amount of data to use SVM.

- 32,885
- 17
- 118
- 213