0

I have multiple datasets of comparable shape. I want to train a separate binary classifier for each of those datasets. These datasets have two problems

  1. Too many dimensions: (E.g. 1 dataset has 160 datapoints and 90 dimensions)
  2. Imbalance: Some of these datasets have significantly more datapoints of first class than of the second (or vice versa)

So, I have tried to split data 90% training vs 10% test set, and trained a basic ridge regression classifier. On most datasets I get almost 100% training accuracy and ~50% test accuracy. On some datasets I get higher test accuracy, until I realise that it has simply learned to always guess for the larger of the two classes.

Does there exist a classifier that automatically deals with those problems, namely:

  1. Tries not to overfit (e.g. by means of intrinsic feature selection)
  2. Tries to have comparable performance for both classes, even if one class has less datapoints than the other.
Aleksejs Fomins
  • 1,499
  • 3
  • 18

1 Answers1

1

A ridge regression (I presume logistic?) or any other regularized method (e.g., a variant of the Elastic Net) would be a reasonable way to start.

Your accuracy problem will go away if you use proper scoring rules to assess quality instead of accuracy.

Unbalanced classes are not a problem if you don't use accuracy: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • I think I used a basic linear ridge regression (not logistic), but that was just the first method I saw on scikit-learn website. The true relationship may be complicated and nonlinear, I will probably try a few different things, maybe random forest and deep NN. Thanks a lot for the links, I will read and get back to you if there is still a problem. – Aleksejs Fomins Feb 27 '20 at 13:23
  • Non-logistic regression makes little sense for a classifier. Try a regularized logistic regression, like [glmnet](https://cran.r-project.org/package=glmnet), of which I assume there exists a scikit-learn variant. – Stephan Kolassa Feb 27 '20 at 13:25
  • Stephan, I have scanned through the links yous have shared. I must admit, I would have to invest several days to make sure I understand scoring rules. I would like to estimate if the effort is worth it at this stage. 1) If I pick a proper loss function, I still need a method that minimizes it, right? Does that mean I can't use the standard methods like ridge regression and need a special method? 2) Could you also comment on overfitting part of the question, if possible? – Aleksejs Fomins Feb 27 '20 at 19:57
  • The idea would be to use a classifier that outputs predicted class membership *probabilities*, not hard single classes. Logistic regression does that. Then use the logarithmic score or the Brier score to assess these predictions compared to the actuals (you can find the two scores on the Wikipedia page). There are actually different definitions of the scores that only differ in a minus sign, so you will need to be careful as to whether you need to maximize or minimize it. As to overfitting, that's where I recommend a regularized method like logistic ridge regression or glmnet. Good luck! – Stephan Kolassa Feb 28 '20 at 06:49