Is it OK to sample based on the DV?

Asked Jan 28 '17 at 03:03

Active Jan 28 '17 at 03:03

Viewed 23 times

I have a dataset with a binary outcome, y.

I'd like to run a decision tree on the data, but y == T is very rare, and so every leaf of the tree predicts y == F.

Is there any problem with sampling based on y, e.g., just using the N cases where y ==T and N random cases where y == F?

asked Jan 28 '17 at 03:03

Jeremy

1

Short answer: yes. You may under-sample y==F and over-sample y==T. – SmallChess Jan 28 '17 at 04:12
This happens in a case control study. So you can look there at what is possible and impossible with such data. – Maarten Buis Jan 28 '17 at 14:58

0 Answers0