1

Usually what I see is that the baseline accuracy and the base model have different accuracies so then the goal is to clean the data and do some feature engineering etc to build a model that performs better than the baseline. Everyone including my professor says that the goal is to build a model than performs better than the stupid model/baseline.

My baseline accuracy is 95.13%. My CART model is also at the exact same performance. In fact, any model I throw at the dataset gives the same accuracy. My target (binary stroke outcome) is highly imbalanced (95% [outcome 0.0] / 5% [outcome 1.0]).

When I perform baseline before feature engineering the accuracy of my CART model and any other model, is 95.13%. After feature engineering they are both still 95.13%.

Is it a coincidence that the target imbalance is also 95%? Not a coincidence right?

Building models for exploration such as KNN, Logistic, C5, CART, NN, they all underperform when comparing with the baseline of 95.13%. Building these models before feature engineering they perform around the same range of 70%-75% and after feature engineering 75%-82%.

Naturally I am performing this baseline analysis without any balancing since the point is to build a stupid model as a benchmark.

So is it ok if my models even after feature engineering, do not perform as well as the benchmark? How would I explain this?

Edison
  • 135
  • 7
  • 2
    If you have an imbalance, accuracy is probably not a good number to use. You should look closer at precision and recall. I would suggest MCMC resampling of the lesser-known class. – Kat Aug 29 '21 at 02:55
  • I downsampled the majority outcome. You're suggesting upsampling of the minority outcome using MCMC. SPSS Modeler software doesn't have MC so I just used its vanilla upsampling of the minority outcome. Remarkably I got 98.5%!! Even higher than the baseline of 95%. Thanks! But here we are talking about balancing the target. What about the issue that my baseline before/after feature engineering is 95% and so are any models I throw at it. Everything is always 95%. Why is this? And why did upsampling work better than downsampling? p.s. You should write an answer so I can mark it as correct :) – Edison Aug 29 '21 at 03:18
  • p.s. my flow is clean->feature engineer->baseline->balance->model. Not sure if this matters. – Edison Aug 29 '21 at 03:42
  • One other question. Why do we balance the target but not the inputs? – Edison Aug 29 '21 at 03:47
  • 2
    [Class imbalance is not such a problem when you use proper statistical methods, and balancing the classes won’t solve a non-problem.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) // Yes, it’s a problem if your fancy modeling cannot beat the silly model. In fact, $R^2$ in linear regression is, in some sense, measuring the amount by which you beat a silly model. – Dave Aug 29 '21 at 04:19
  • @Dave Thanks :) Btw, What about the issue that my baseline before/after feature engineering is 95% and so are all flavours of models I throw at it. Everything is always 95%. I thought that even regardless of whether it's before or afterfeature engineering the baseline accuracy and model accuracy should always be different. Is it because in my case the imbalance is so high that the models just automatically predict/classify based on the majority outcome? – Edison Aug 29 '21 at 05:27
  • 2
    The stupid model that predicts only 0 has 95%. So this suggests there is little signal in the data. That is, there are no variables that are predictive of the outcome. Also, rare outcomes (especially rare deaths/disease) are notoriously difficult to predict. it is much more useful to look at the conditional odds ratios or relative risk relations between the outcome and baseline variables and identify high risk regions of the covariate space. also, to make sure you are not wasting your time, run logistic regression and check if any coefficients are significant in both effect size and p-value. –  Aug 29 '21 at 05:41
  • Also plot a histogram of your predictions. Are virtually all predictions at 0.5? or is there some variance? –  Aug 29 '21 at 05:42
  • @LarsvanderLaan Thank you. `"...more useful to look at the conditional odds ratios or relative risk relations between the outcome and baseline variables"` means issues like multicollinearity? And `"...identify high risk regions of the covariate space"`, means inputs that are not significant? --- plotting a histogram it was flat with everything at 5110 (sample size). image here >> https://i.imgur.com/7Yt7BCh.png – Edison Aug 29 '21 at 06:22
  • Logistic Regression screenshot >> https://i.imgur.com/hJfya7g.png – Edison Aug 29 '21 at 06:27
  • Can you add more bins to the histogram? (e.g. one bin for 0.95, 0.96, 0.97, etc. How many unique values do you have? No, I mean look at the conditional odds ratios or relative risk predictions/coefficients as they are more informative than predictions. You can check whether a given variable has a large conditional odds ratio or relative risk with the outcome, which would be useful to assess if the variable is predictive of the risk of stroke (i.e. your outcome). What is your goal exactly? To predict stroke? Or to develop a risk score? –  Aug 29 '21 at 06:27
  • Goal is to predict probability of having a stroke. Target variable is flag/binary [0,1]. I'm currently using C5, CART, KNN. For my features and target upsampling, those 3 seem to have the highest accuracy. Amazingly, I'm forced to use SPSS Modeler at university instead of Python. – Edison Aug 29 '21 at 06:35
  • I mistakenly ran a histogram on the CART model before that's why it was flat. Here is a histogram for the logistic regression model. Before feature engineering and before balancing. https://i.imgur.com/cysL5Xq.png – Edison Aug 29 '21 at 06:39
  • Do not use accuracy. [This thread](https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models) has already been mentioned. – Stephan Kolassa Aug 29 '21 at 08:44

1 Answers1

3

With unbalanced data, where 95% of cases are zeros, you get 95% accuracy by always predicting zeros. It’s likely not a coincidence. If you can’t beat that, your models are rather useless. If you tried different models and are never able to beat the benchmark, maybe you just don’t have relevant data that would allow to make meaningful predictions? Did you explore the predictions? Are you sure that there’s no bugs in reading and preprocessing the data, and that the labels are correct?

On another hand, as others already mentioned, accuracy is a rather poor metric. What exactly you need this model for? Maybe you care more about something like recall and there are differences between the models? Of course, I’m not suggesting here to cherry-pick metrics. Also, if you use a different metric, you should as well compare it to some benchmark.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Thanks Tim. As explained above in one of my comments, thanks to Kat's suggestion I was able to beat the baseline. 95.13% >> 98.69% using a C5 model after switching from downsampling the majority outcome to upsampling the minority outcome. It's a reliable dataset taken from Kaggle and has been used in many Python guides. My professor made us use SPSS Modeler (cringe). I cleaned it thoroughly and engineered new features that improved performance. Kat also mentioned measuring precision and recall. – Edison Aug 29 '21 at 07:40