1

I have a very large data set that I want to perform classification tasks on. There are about 40 million instances, 16 features, and 2 classes.

I'm attempting to use SciKit-learn LinearSVC and LogisticRegression, but after several hours the processes still have not completed.

I have two questions:

  1. Is there a way I can estimate the runtime of SciKit-learn classification algorithms? How can I know if the process will complete in minutes, hours, days...?
  2. Is there a certain algorithm which can scale exceptionally well for large data sets? Is there a library implementing this?
Gilles
  • 1,022
  • 1
  • 10
  • 21
MVTC
  • 113
  • 6
  • Your first question would be off-topic in here (it is about programming in python), while the second one is on-topic. As about guessing the runtime: run it on smaller data samples and interpolate from it to the bigger sample (notice that this does not to have be linear), see e.g. https://cs.stackexchange.com/questions/192/how-to-come-up-with-the-runtime-of-algorithms – Tim Mar 17 '16 at 08:23
  • I'd argue that the time-complexity of various algorithms could be on-topic here too. – Matt Krause Mar 18 '16 at 21:53

2 Answers2

2

Some suggestions

LASVM ==> http://leon.bottou.org/projects/lasvm - an online SVM. Never used

PMSVM ==> https://sites.google.com/site/wujx2001/home/power-mean-svm - a linear SVM. A friend used on 10M data, 200K features.

Logistic regression with SGD http://deeplearning.net/tutorial/logreg.html#logreg the first tutorial in the deeplearning.net on how to use Theano.

Jacques Wainer
  • 5,032
  • 1
  • 20
  • 32
1

SGD Classifier should fit your purpose. With your amount of data and n_iter=5 by default it could converge in an hour or so.

But try on a subsample first to gauge. For other classifiers too.

Diego
  • 434
  • 3
  • 9