0

It seems to be a very simple question: Does the "true distribution" or "natural distribution" of training data matter in machine learning?

The motivation for asking this question comes from the classification of imbalanced data. To handle the imbalanced data problem in classification, we can use methods such as under-sampling or up-sampling to generate a relatively balanced training data set, which can help the model perform better. In this case, haven't we changed the distribution of the data?

Furthermore, can we arbitrarily change the distribution of data to observe the performance of the model, and then select the best model? This seems very unreasonable, but where is the problem?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
ConnellyM
  • 13
  • 3
  • I wrote an answer to a similar question a while back [here](https://stats.stackexchange.com/questions/478758/should-we-really-do-re-sampling-in-class-imbalance-data/479102#479102). Let me know if that answers your question. – Demetri Pananos Oct 09 '20 at 13:22
  • Unbalanced data is not necessarily a problem: https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning https://stats.stackexchange.com/questions/357377/random-over-sampling-to-handle-data-imbalance https://stats.stackexchange.com/questions/199230/downsampling-vs-upsampling-on-the-significance-of-the-predictors-in-logistic-reg – kjetil b halvorsen Oct 09 '20 at 14:25
  • @Demetri Pananos the answer is not what i am looking for but thanks for your reply. – ConnellyM Oct 10 '20 at 07:18
  • @ kjetil b halvorsen thank you for the linked anwers. it take me some time to read them, but i am still confused with my question, can you explain it in detai? – ConnellyM Oct 12 '20 at 07:00

0 Answers0