0

I have a data set with a binary response variable, about 30,000 observations of 8 features, some are continuous and some are categorical.

This is an imbalanced data set, the ratio of negatives to positives is about 5:1. So the null accuracy (always predicting negatives) is ~84%.

I know that for imbalanced data sets accuracy is usually not a good metric. But in my context a high accuracy is desirable and because the imbalance is not extreme I think it is possible to improve. I would like to shoot for at least 90% accuracy.

I have tried various feature engineering techniques and machine learning models but I am not even able to hit 86%. For example, decision trees, logistic regression and random forests all give about 85.6-8% accuracy. I used cross-validation for finetuning hyperparameters and checked training accuracy to make sure there was no overfitting.

What could be the reasons for getting such a marginal improvement in accuracy over a dumb model?

Ali
  • 73
  • 1
  • 1
  • 3
  • 4
    "because the imbalance is not extreme I think it is possible to improve" - whether your data are imbalanced or not has no bearing on whether you can improve on a simple model. You can't improve on 50% accuracy for predicting a coin toss, no matter how much (balanced) data you have. You can't improve on 16.7% accuracy for predicting a die roll, no matter how much (imbalanced) data you have. – Stephan Kolassa Jan 20 '19 at 17:19
  • Thanks for the link and comment Stephan. My point was that it's not like the null accuracy is 99.9%. Since you mentioned prediction of completely random events, are you implying that that the response variable is almost random with the current set of features? – Ali Jan 20 '19 at 18:18
  • 2
    What leads you to believe that your features are strongly predictive of your outcome? – Sycorax Jan 20 '19 at 18:49
  • 1
    Anyhow, accuracy is not a proper score function. See https://stats.stackexchange.com/questions/359909/is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting – kjetil b halvorsen Jan 20 '19 at 20:08
  • Ali, what @stephen and sycorax are asking is why do you think you can do better? Are you sure the variables are suitably discrimatory ( no one else can decide that for you) – seanv507 Jan 20 '19 at 20:27
  • 1
    Sorry for the late response. Using other metrics such as AUC I was able to come up with a good model (AUC > 0.8). Thank you very much! – Ali May 19 '19 at 20:19

0 Answers0