My data set has near duplicate rows because there are multiple rows for each employee depending on how long they have stayed in the organization. Therefore, employee Ann has 3 rows, Bob has 2 rows etc. Most features in the data set do not change over time. I am dropping the EmpID and time and running a classification on the other features.
Since some features don't change over time, they are repeated. Some repeated thrice, some twice depending on how many years the employee has been in the organization in the 3 year data taken for the study.
Will this adversely impact Gini Index calculation (or entropy) since some are repeated more number of times ? By doing this am I giving more weight to an employee who has stayed longer when I shouldn't be ? For example, Ann has Feature4 repeated thrice while Diane has only once. Should I consider rolling up so, that I have one row per employee ?
I am trying Random Forest for classification. I believe Gini is used for node selection / split. Hence my question.
EmpID time Feature1 Feature2 Feature3 Feature4 Feature5 Feature6 Target
Ann 1 Commence Female 20 Ref-Yes 3.6 Good 0
Ann 2 Not Female 21 Ref-Yes 4.0 Good 0
Ann 3 Not Female 22 Ref-Yes 3.2 Good 0
Bob 2 Commence Male 19 Ref-No 2.6 Avg 0
Bob 3 Not Male 20 Ref-No 2.7 Avg 1
Cathy 2 Commence Female 24 Ref-No 1.6 Good 1
Diane 3 Commence Female 37 Ref-Yes 6.6 Very Good 1