3

My data set has near duplicate rows because there are multiple rows for each employee depending on how long they have stayed in the organization. Therefore, employee Ann has 3 rows, Bob has 2 rows etc. Most features in the data set do not change over time. I am dropping the EmpID and time and running a classification on the other features.

Since some features don't change over time, they are repeated. Some repeated thrice, some twice depending on how many years the employee has been in the organization in the 3 year data taken for the study.

Will this adversely impact Gini Index calculation (or entropy) since some are repeated more number of times ? By doing this am I giving more weight to an employee who has stayed longer when I shouldn't be ? For example, Ann has Feature4 repeated thrice while Diane has only once. Should I consider rolling up so, that I have one row per employee ?

I am trying Random Forest for classification. I believe Gini is used for node selection / split. Hence my question.

EmpID   time  Feature1  Feature2    Feature3  Feature4  Feature5 Feature6 Target   
Ann     1     Commence  Female      20        Ref-Yes   3.6      Good        0  
Ann     2     Not       Female      21        Ref-Yes   4.0      Good        0
Ann     3     Not       Female      22        Ref-Yes   3.2      Good        0
Bob     2     Commence  Male        19        Ref-No    2.6      Avg         0
Bob     3     Not       Male        20        Ref-No    2.7      Avg         1
Cathy   2     Commence  Female      24        Ref-No    1.6      Good        1
Diane   3     Commence  Female      37        Ref-Yes   6.6      Very Good   1
learner
  • 537
  • 2
  • 8

1 Answers1

1

I will use the notation used here: https://stats.stackexchange.com/a/44404/2719

Let's consider this toy dataset:

EmpID   Feature2    Feature4  Target   
Ann     Female      Ref-Yes   0  
Ann     Female      Ref-Yes   0
Bob     Male        Ref-No    0
Cathy   Female      Ref-No    1

You can compute the $\Delta$ for Gini impurity for each feature: $$ \Delta(Feature2,Target) = 1 - (3/4)^2 - (1/4)^2 - 3/4\Big( 1 - (2/3)^2 - (1/3)^2\Big) - 1/4 \cdot 0 \approx 0.041 $$ $$ \Delta(Feature4,Target) = 1 - (3/4)^2 - (1/4)^2 - 1/2 \cdot 0 - 1/2 \Big( 1 - (1/2)^2 - (1/2)^2\Big) \approx 0.125 $$ According to this, $Feature4$ seems to be better than $Feature2$. Thus a decision tree induction algorithm (including Cart and Random Forest) would choose to split the node based on $Feature4$

If you remove the duplicated Ann this will be the dataset and the $\Delta$:

EmpID   Feature2    Feature4  Target     
Ann     Female      Ref-Yes   0
Bob     Male        Ref-No    0
Cathy   Female      Ref-No    1

$$ \Delta(Feature2,Target) = 1 - (2/3)^2 - (1/3)^2 - 2/3\Big( 1 - (1/2)^2 - (1/2)^2\Big) - 1/3 \cdot 0 \approx 0.11 $$ $$ \Delta(Feature4,Target) = 1 - (2/3)^2 - (1/3)^2 - 1/3 \cdot 0 - 2/3\Big( 1 - (1/2)^2 - (1/2)^2\Big) \approx 0.11 $$ The $\Delta$ are the same which implies that the prediction power of the two feature is the same.

In general, if you leave such duplicates it would mess up the $\Delta$ calculations.

Simone
  • 6,513
  • 2
  • 26
  • 52
  • Thank You. Do you suggest having 1 row per person to avoid this mess ? – learner Aug 16 '20 at 23:57
  • The toy data set considers duplicate rows, but I have at least one feature (say Feature 5) where value changes. The toy data set treats rows as perfect duplicates while they are almost duplicates, but not exactly. – learner Aug 17 '20 at 02:28
  • 1
    OK I didn't see that. In theory, classifiers require all rows to be independent and identically distributed (iid). If you have exact duplicates, rows are not iid. Feature 5 complicates things. If you mean to classify an instantaneous event (which does not really depend to rows at previous times) I guess you could still use random forests. E.g. I see that Bob gets 0 for Target but also 1 on another row. A naive example for Target could be "working from home". I could only look at individual values for each day and try to predict if an employ will work from home or not. – Simone Aug 17 '20 at 08:52
  • When time is involved things are more complicated. It really depends on what your Target variable is. I would look into survival analysis and random survival forests https://github.com/sebp/scikit-survival – Simone Aug 17 '20 at 08:54
  • Thank you. Bob started in Year 2 but left the organisation in Year 3. So, row has a target 0 indicating he is still with the Organization and Year 2 he left the Organization and the target is marked as 1. It’s the same person Bob who is followed in Year 2 and Year 3. Cathy started in Year 2 and left in Year 2, hence one row only with target marked 1. I am unable to conclude if this dataset is iid. Do you think marking both (all) rows for Bob as 1 would make a difference. Appreciate your advice. I will also look at survival analysis. Thank you once again. – learner Aug 17 '20 at 10:15
  • 1
    They are not iid because they are actually time series. I think survival analysis would fit well in this context. It is used also to model customer churn.Though, you could simplify the problem by aggregating values in order to keep only one row for each employee. Rows would be iid then. Do you want to predict if someone leaves the organization? I would have Target 1 or 0 for each employee (1 left the organization). Add Feature0: length of stay (out of time t). You could take the average value for Feature5. Other options for Feature5 are the max value or the last value. – Simone Aug 17 '20 at 11:50
  • 1
    I have been reading up on survival analysis and looks like exactly what I need. Also, I am considering aggregating the rows. Thanks a ton !!! – learner Aug 17 '20 at 22:55