Rare event bias techniques

Question

First query, so apologize in advance for any stupidity or "unawareness". I have a large sample, at roughly 88000 obs. But, my events for this sample (the 1's) are about .00072% of the sample.

Pretty sure that my sample suffers from rare event bias. Therefore, I am using the logistf function to run a logistic model. But not sure that this is the best method. I've read the standard King and Zeng paper. But I am just getting some unusual results. Meaning, that variables that I thought would be significant, are just not coming out that way. In addition, the df for the lrtest and extractAIC are really small, between 5 to 7 for any model that I have run.

Sorry, I can't provide screen shots or results. Work data, so not sure that I can share.

I think this is something more suitable to cross-validated. Have flagged. — Heroka, Oct 16 '15 at 16:53
You're probably going to have to provide more information. Can you produce a [reproducible example](http://tinyurl.com/reproducible-000), e.g. by simulating data that *looks* like your data set? By my computation, you have approx. 64 positive values in your response, so you need to be aware of the rules of thumb for model complexity (e.g. Frank Harrell's) -- you shouldn't be trying to fit a model with more than (64/(10 to 20)) = 3-6 parameters, unless you're using shrinkage methods ... — Ben Bolker, Oct 16 '15 at 17:14
Do you mean .072%? (.00072% leads to less than one positive observation) — Tchotchke, Oct 16 '15 at 17:34
This is my issue with a reproducible example: "the data is sometimes the limiting factor, as the structure may be too complex to simulate." I have been sticking to under 6 parameters. Usually 3 or 4. I spent two days going over data integrity, and have cleaned it up as much as possible. Still just doesn't seem to be working. — In over my head, Oct 16 '15 at 18:00
Consider "Importance Sampling" as an approach; might be worth skimming thru "Numerical Recipes in C" to see if there's an algorithm there that can help. — Carl Witthoft, Oct 16 '15 at 18:39

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

There are some really good answers from this CV post, Suggestions for cost-sensitive learning in a highly imbalanced setting.

In practice, I've found that bagging a set of undersampled, boosted trees to work well. So for each model, you randomly sample your negatives down to be about 10x the size of your positive class, train one model, repeat. Then you'll end up with X models - bag the results from there.

Rare event bias techniques

1 Answers1