I have multiple datasets of comparable shape. I want to train a separate binary classifier for each of those datasets. These datasets have two problems
- Too many dimensions: (E.g. 1 dataset has 160 datapoints and 90 dimensions)
- Imbalance: Some of these datasets have significantly more datapoints of first class than of the second (or vice versa)
So, I have tried to split data 90% training vs 10% test set, and trained a basic ridge regression classifier. On most datasets I get almost 100% training accuracy and ~50% test accuracy. On some datasets I get higher test accuracy, until I realise that it has simply learned to always guess for the larger of the two classes.
Does there exist a classifier that automatically deals with those problems, namely:
- Tries not to overfit (e.g. by means of intrinsic feature selection)
- Tries to have comparable performance for both classes, even if one class has less datapoints than the other.