How to compare the performance of two classification methods? (logistic regression and classification trees)

Question

I'm struggling a little bit with comparing these two classification methods. Although I know it is often a bad idea to use stepwise logistic-regression, I still want to perform it and analyse the difference. I had different approaches in mind. My data set contains about 2500 observations and 40 feature variables.

Split data randomly into training testing set. For example 80%/20 and run a classification tree and stepwise logistic regression (using different information criteria) on the training set and then evaluate it on the test set
However, since the size of the trees and the number of feature variables selected by the stepwise regression vary, I thought, it would be a good idea to run cross-validation. However, this is kind of tricky to me. Let's say I try to run a 5-fold CV on my 80% training data. I can evaluate my models within the cross-validation and get for example averaged accuracy and other performance measures for the different models (classification tree and logistic regression). But, how can I use that since I still want to evaluate the test model?
Use all my data to run cross-validation and then take average performance measures as final results to interpret.

Are these legitimate approaches? Or at least some of them? What would you recommend? Thank you in advance for your help!

Before you go further, you need to address the cases/predictors ratio. With that many predictors in logistic regression there is a good chance that you will encounter [perfect separation](http://stats.stackexchange.com/q/45803/28500). In any event, to avoid overfitting with only 289 cases in the least-frequent class, you probably should consider no more than 289/15 or about 20 _candidate_ predictor variables. See Harrell's [rms course notes](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RmS/rms.pdf) or [book](http://www.springer.com/us/book/9783319194240) for ways to proceed. — EdM, Dec 09 '15 at 18:57

score 1 · Answer 1 · answered Dec 09 '15 at 19:03

For the amount of data you have CV is a good approach. However, you want to run this on the entire dataset. Essentially it is going to sample the dataset creating 5 90-10 splits.

There are also other things you must consider when comparing models. Such as precision ans recall. Just because a model performs at 97% accuracy does not mean the model performs well if it is completely misclassifying a given class.

If it is a two class classification problem I always like to push the data into a confusion matrix as well as a the ROC curve. The ROC Curve is very nice because then myself or the end users can decide what is an appropriate amount of false positives. I always like the example of a false positive for a CT scan be very costly, opposed to an autodialer calling someone who may not need to be contacted.

Thank you for the inputs. I already created confusion and ROC curves - gives a great impression of how the classifier is performing. Why do you mean 5 90-10 splits? — Patrick Balada, Dec 09 '15 at 19:56

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

If your goal is to obtain an unbiased estiame of the test accuracy of your methods you should not perform any kind of model selection (e.g. hyperparamether optimization) while looking at results on the test set.

Moreover, I believe that the third approach you describe (although a common practise) is not legitimate. You can find more information about this here: Cross-validation misuse (reporting performance for the best hyperparameter value)

score 1 · Answer 3 · answered Mar 14 '18 at 19:17

Logistic regression is not a classification method. So in your original sense you cannot compare them. Logistic regression is a direct probability estimation model. And you are ignoring a huge literature testing stepwise regression. To compare logistic regression to a classifier you'll have to turn the classifier into a "probability machine". then use a proper accuracy scoring rule such as the Brier score.

How to compare the performance of two classification methods? (logistic regression and classification trees)

3 Answers3

Linked