Working around rare data - Logistic Regresion

Question

I have a machine learning problem that I am tackling with binary logistic regression. My needles occur at a rate of about 4%. Following on from [1] and with ~50 variables I conclude that I need about 300 needles.

Unfortunately I don't have enough items of hay to build a training set with the correct ratio of needles to hay (ie 96 to 4).

Is there a sensible way around this?

At the moment I am considering building an incorrectly balanced sample (90/10) and then adjusting the algorithm [2] parameters for 'weights'. (this doesn't work, see comment)

All help appreciated.

[1] How large a training set is needed?

[2] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html

score 3 · Accepted Answer · answered Aug 03 '16 at 14:29

There are several possible workarounds. The simplest, although not necessarily the best, is to use poisson regression instead of logistic regression. The reasoning behind this is that the poisson model was "designed" for rare events.

Gary King discusses case control, bias corrections and adjusting causal models for rare event data in the context of political outcomes. He has published several papers as well as a software tool (ReLogit) specifically for this purpose. http://gking.harvard.edu/category/research-interests/methods/rare-events

Finally, Paul Allison, founder of the training institute Statistical Horizons and one of the best writers on statistical issues out there, cites King's article in formulating his views on the subject. His opinion is that,

The problem is not specifically the rarity of events, but rather the possibility of a small number of cases on the rarer of the two outcomes. If you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you’re golden.

He offers a course for dealing with these issues as well as a useful discussion here ... http://statisticalhorizons.com/logistic-regression-for-rare-events

The equation on page 8 of http://gking.harvard.edu/files/gking/files/baby0s.pdf - worked the charm for me. Thank you. — draco_alpine, Aug 04 '16 at 08:17

Working around rare data - Logistic Regresion

1 Answers1