I need to perform an analysis on employee attrition using Machine Learning
algorithms. I intend to do both Supervised Learning
analysis (classification
) and Unsupervised Learning
analysis (pattern detection) on the data set.
My data set is a list of all employees (actual and leavers) from the past 3 years.
The original data set contain around 70% actual employees and 30% leavers.
I am confused about how to split my data set for the training set. Should it be equally balanced between actual employees and leavers for the Supervised Learning problem? And should I care about giving this same treatment to the training set when running the Unsupervised Learning algorithm?
I have gone through the following post but I am still confused: