Recommend classification algorithms to try

Question

I am working on a binary classification problem that is reasonably-sized (100k observations). I extracted 60 numerical features; the classes in the training set are well balanced. There are some significant linear patterns, but after that the patterns seem very nonrandom and so I need classifier models that can deal with this.

I am really looking to squeeze out the best possible (estimated) accuracy, at the sacrifice of computational effort, so I am considering creating an ensemble classifier.

So far, I have received pretty good results with:

a Random Forest classifier (90% CV accuracy)
a radial basis SVM classifier (87% CV accuracy, still busy tuning it on a finer grid).

I am now wondering if there are any other potentially interesting algorithms that I could add to the mix (three would be nice for majority voting, for example). I hope diverse models will help me shed some potentially remaining bias and improve accuracy a little bit. Preferably I'd use algorithms available through R's caret package. I am looking at Gaussian processes right now.

My background in machine learning is not very theoretical; I really only have intimate experience with SVM, decision trees and random forests, so the list of algorithms in caret is pretty daunting and I am having a hard time finding applied studies that compare them. I know it's hard to predict relative performance on particular data, but I'm willing to burn through a few of them!

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

1

Can you describe simply what are these features ? If the features come from some complex data, like images or audio files, the size of your dataset allows you to use a classifier which learn itself the intermediate representation, for example deep neural networks. I don't know if R provides good ressources for deep learning, you can start for example with RBM, as shown in this question. I don't think the common libraries implement Autoencoder. See these lecture notes and Baldi, 2010 for an introduction.

edited Apr 13 '17 at 12:44

Community

1

answered May 11 '13 at 16:24

Emile

3,150
2
20
17

Thanks for the tip, I'll look into it! I was actually considering neural networks as a third option, but I figured it would be pretty hard for me to approximate higher order functions as the popular R libraries seem to simulate only fairly simple layouts. As for the features, they are mostly customer data (age, rating), financial (money spent, products viewed etc) so there is probably not really a 'big picture' underneath - in fact the first principal components are pretty boring, but I guess that's because they are true :) – vettel May 11 '13 at 20:06

Recommend classification algorithms to try

1 Answers1