0

I am trying to build a model that predicts the which binary category a respondent belongs to (0 or 1). I have demographic variables (all categorical) and a few 10 point questions.

I have built a few predictive models (just for comparison) in both R and SPSS. In SPSS I have built a Logistic Regression model, while in R I have modeled using a Decision Forests. The overall accuracy in both models is around 66%. This is seems good, however, the accuracy for correctly predicting those that are in group 1 (given those that are in group 1 and accurately predicting that they are in group 1) is only around 27%. The number of respondents in group 0 is larger than those in group 1 which is the reason why there is a significant difference between the two accuracies (66% and 27%).

In the data set, about 37% of the sample makes up group 1. I'm wondering if there is some way to increase the 27%, as the model is currently useless (i.e. I need to know more accurately those in group 1). Or is there another method of modeling that I should be using?

Any help at all would be greatly appreciated!

Thanks!

Jordan
  • 33
  • 2

2 Answers2

1

There are perhaps two separate issues:

(1) Logistic models are not classification models - they are probability based and inherently different. In that sense accuracy is a poor measure for such models and skewed outcomes data are handled differently. Specifically the model outputs a probability rather than a class and you can change accuracy by changing the probability cutoff you use to classify outcomes. Many stats programs will assume a 50% probability cut-off, but that is arbitrary and you should decide on your own probability cut point depending on need.

(2) Is a decision forest a random forest? Is so there are many ways of dealing with skewed data - this is easily a book chapter or more. Max Kuhn does a reasonable job covering this in his book Applied Prediction Modelling. The code is freely downloadable from CRAN and I think the book's website. So difficult to provide a complete answer. The short answer is that this is most often dealt with by undersampling the majority class when building the random forest. The Balanced Random Forest is a simple approach to undersampling, but there are many possible approaches. The linked paper also has summary of alternative approaches.

charles
  • 2,436
  • 11
  • 13
  • Hi Charles! Thank you so much for your reply and insight. I definitely see what you're saying about the Logistic model. I was thinking about the cutoff point as well...just not sure what it should be. This is something that I need to look into. I'm actually not sure if a decision forest is a random forest. I will look into this. – Jordan Feb 02 '15 at 15:54
0

Imho you have skewed learning data set. Look here: How to handle skewed binary target variables?

And google for "dealing with skewed classes".

404pio
  • 204
  • 3
  • 12
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Sven Hohenstein Jan 31 '15 at 00:58
  • @SvenHohenstein You are right with new questions, but I think that this one is duplicate, that's why I posted one-line answer. – 404pio Feb 03 '15 at 20:18