I have a machine learning problem that I am tackling with binary logistic regression. My needles occur at a rate of about 4%. Following on from [1] and with ~50 variables I conclude that I need about 300 needles.
Unfortunately I don't have enough items of hay to build a training set with the correct ratio of needles to hay (ie 96 to 4).
Is there a sensible way around this?
At the moment I am considering building an incorrectly balanced sample (90/10) and then adjusting the algorithm [2] parameters for 'weights'. (this doesn't work, see comment)
All help appreciated.
[1] How large a training set is needed?
[2] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html