1

I am trying to build a decision tree model and I have 700,000 true values while 1,300,000 data points are false values, in total, I have 2,000,000 data points including duplicates. I am wondering if the dataset is an imbalanced dataset for a decision tree model.

If this is an imbalanced dataset, can I use 1,000,000 true values and 700,000 false values making them balanced allowing duplicates to build a decision tree model, instead using all of them?

Although I don't know what effect exactly the duplicates will do to the model.

StoryMay
  • 2,273
  • 9
  • 25
  • 1
    No, it is not imbalanced. Maybe if there is severe duplication there are some issues but on face value a 35-65 split is fine. – usεr11852 Nov 14 '21 at 13:51
  • 3
    By definition, your dataset is imbalanced, at least a little bit. However, are you sure that class imbalance is even a problem? https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Nov 14 '21 at 13:56
  • I think everyone would agree that a data set with a 1:1 ratio of 2 classes is "balanced." And also, a data set with $n$ of one class and $n + 1$ of the other class (where $n$ is "large") is also "close enough" to 1:1 to be balanced. But at some point, there exists some ratio of $n:m$ that is imbalanced. But where, in the range of values between $n+1$ and $m>n$ does the data set stop being "balanced"? This question is not too different from the [Sorites paradox](https://plato.stanford.edu/entries/sorites-paradox/). – Sycorax Nov 14 '21 at 19:31

1 Answers1

2

700K and 1.3M is pretty balanced to me, usually people are talking about 1:1000 or even worse as imbalanced data. For example, fraud detection, it can be 1:100000.

In addition, Decision tree (and most machine learning model, such as logistic regression) works fine with the imbalanced data set. Just pay attention to evaluation metrics, and use precision and recall instead of accuracy.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213