0

I need to perform an analysis on employee attrition using Machine Learning algorithms. I intend to do both Supervised Learning analysis (classification) and Unsupervised Learning analysis (pattern detection) on the data set. My data set is a list of all employees (actual and leavers) from the past 3 years. The original data set contain around 70% actual employees and 30% leavers.

I am confused about how to split my data set for the training set. Should it be equally balanced between actual employees and leavers for the Supervised Learning problem? And should I care about giving this same treatment to the training set when running the Unsupervised Learning algorithm?

I have gone through the following post but I am still confused:

When should I balance classes in a training data set?

user3115933
  • 137
  • 4

1 Answers1

0

You should be looking into stratified sampling. You need not have equal samples of actual and leaver in the training data. However what you would need is to use the variable (the flag for churn) in your strata group to ensure that the ratio of actual and lever is similar across both the training and test samples.

Srikrishna
  • 40
  • 3
  • If I understand well, you mean that if the variable, say, Status = 1 or 2 (1 for Actual, 2 for Left) has a ratio of 70:30 in the original data set, I must ensure that the training data set and test data set are both still with a ratio 70:30 after the split. – user3115933 Jul 19 '18 at 11:59
  • Yes that’s what I mean. – Srikrishna Jul 19 '18 at 12:09