Best Scalable Classification Algorithms

Question

I have a very large data set that I want to perform classification tasks on. There are about 40 million instances, 16 features, and 2 classes.

I'm attempting to use SciKit-learn LinearSVC and LogisticRegression, but after several hours the processes still have not completed.

I have two questions:

Is there a way I can estimate the runtime of SciKit-learn classification algorithms? How can I know if the process will complete in minutes, hours, days...?
Is there a certain algorithm which can scale exceptionally well for large data sets? Is there a library implementing this?

Your first question would be off-topic in here (it is about programming in python), while the second one is on-topic. As about guessing the runtime: run it on smaller data samples and interpolate from it to the bigger sample (notice that this does not to have be linear), see e.g. https://cs.stackexchange.com/questions/192/how-to-come-up-with-the-runtime-of-algorithms — Tim, Mar 17 '16 at 08:23
I'd argue that the time-complexity of various algorithms could be on-topic here too. — Matt Krause, Mar 18 '16 at 21:53

score 2 · Accepted Answer · answered Mar 18 '16 at 22:16

Some suggestions

LASVM ==> http://leon.bottou.org/projects/lasvm - an online SVM. Never used

PMSVM ==> https://sites.google.com/site/wujx2001/home/power-mean-svm - a linear SVM. A friend used on 10M data, 200K features.

Logistic regression with SGD http://deeplearning.net/tutorial/logreg.html#logreg the first tutorial in the deeplearning.net on how to use Theano.

Diego · Answer 2 · 2016-03-19T11:09:08.537

1

SGD Classifier should fit your purpose. With your amount of data and n_iter=5 by default it could converge in an hour or so.

But try on a subsample first to gauge. For other classifiers too.

edited Mar 19 '16 at 11:09

answered Mar 18 '16 at 21:23

Diego

2 Answers2