Cluster Employees based on multiple Check in-Check out time in a day

Question

Using machine learning techniques, is it possible to analyse Employee Check in-Check Out time over a period of time and cluster them.

1. An employee can check-in/checkout any number of times in a day.
2. They have a defined roster schedule.
3. Employees may go to different buildings for work.
4. Location, department can be some additional fields while clustering.

Edit:
How to represent multiple check-in/check-out in a day to expose to time series algorithm and then cluster ?

Sample test data of 6 employees for a week period
https://www.dropbox.com/s/5f9hxuyr12x26rc/testData.csv?dl=0

* Initials rows *
Idnumber,Dept,EventTime,Location
1651589,D2000,2017-08-14 15:39:02,BLDG2-7F-D015 Entry
1651589,D2000,2017-08-14 15:38:54,BLDG2-7F-D018 Exit
83240,D1000,2017-08-14 15:22:37,BLDG1-4F-D004
1651589,D2000,2017-08-14 15:11:26,BLDG2-7F-D018 Entry
1651589,D2000,2017-08-14 15:11:20,BLDG2-7F-D015 Exit
62879,D1000,2017-08-14 14:49:15,BLDG1-4F-D004
62879,D1000,2017-08-14 14:47:10,BLDG1-3F-D004
83240,D1000,2017-08-14 14:45:40,BLDG1-4F-D006 Entry
83240,D1000,2017-08-14 14:37:53,BLDG1-4F-D006 Exit
84778,D1000,2017-08-14 14:24:41,BLDG2-GF-G018 Entry
1662394,D2000,2017-08-14 14:13:11,BLDG2-1F-G025 Entry
1662394,D2000,2017-08-14 14:12:19,BLDG2-1F-G025 Exit
84778,D1000,2017-08-14 14:11:17,BLDG1-GF-G003 Exit

...

The OP seems to ask about how to cluster employees, where each employee has multiple intervals, i.e., his check-in and check-out times, with potentially multiple check-ins and -outs per day. This does not seem to be overly hard. I have nominated the question for reopening; if it is reopened, I'll provide an answer. @AljoJose: could you in the meantime please add some representative data to the question? — Stephan Kolassa, Aug 29 '17 at 07:50
Ah. I think I'm still misunderstanding. You only have a single employee whose working times you want to cluster? What's the output you would expect from your example data? — Stephan Kolassa, Aug 29 '17 at 14:52
@Stephan, your initial understanding was correct. There are multiple employees, I just have given sample data of one employee here. Output should cluster employees based on checkin/checkout, location. — Aljo Jose, Aug 29 '17 at 16:08
OK. Then can you please put up data of multiple employees (e.g., only timestamps for one week), so we actually have something to cluster? — Stephan Kolassa, Aug 29 '17 at 16:44
@StephanKolassa, Thank you. Now I have updated test data for a week. https://www.dropbox.com/s/5f9hxuyr12x26rc/testData.csv?dl=0 — Aljo Jose, Aug 30 '17 at 06:24
Can you say more about what you want to discover by clustering? There are different ways to approach the problem, depending on what you're interested in. — user20160, Aug 30 '17 at 07:25
@user20160, It is to understand hidden patterns in employee's actual work schedules to help business in making decisions for improvement. — Aljo Jose, Aug 30 '17 at 08:35
@Tim, Sorry for lack of clarity. Now, I have elaborated the question and details. — Aljo Jose, Aug 30 '17 at 08:36
What I'm asking is what types of 'hidden patterns are you interested in. For example, you might want to find groups of employees who tend to enter/leave at similar times. Or you might want to find groups of employees who overlap in space and time but don't necessarily have correlated entry/exit times. These goals would require different approaches to clustering. In general, you have to specify what kinds of patterns you're interested in before you can come up with an algorithm to find them. — user20160, Aug 31 '17 at 23:16
@user20160 objective is to group employees who tend to enter/exit at similar times. There can be multiple entry-exit possible in a day. — Aljo Jose, Sep 02 '17 at 12:38

score 1 · Answer 1 · edited Jun 11 '20 at 14:32

Sorry it took me so long - a lot of unforeseen things came up. Unfortunately, I won't have the time to go into your actual data, but some pointers may be useful.

Clustering typically involves three steps, the last of which is optional but helpful.

Choosing a distance metric between data points. Your data points are the periods in which any one employee is checked in, and you need to define a distance between the disjoint intervals from two such individuals. Which metric you want to use will depend on what you want to do with the final clustering.

For instance, you could determine the length of time that both employees are checked in, then normalize it with the total length of time you are considering, and finally subtract this number from one. This will give you distances between one (if both employees are never checked in at the same time) and zero (if both always check in and out at the same time). Or you could include whether both are checked into the same building.
Given a distance metric, you need to decide on a clustering algorithm. $k$-means is the first algorithm one learns about, and for many people it is synonymous with "clustering", but it has drawbacks. I personally like DBSCAN, because it does not require a prespecified number of clusters, works with non-circular clusters and allows outliers or noise, i.e., data points that are not assigned to any cluster.
Finally, it is often helpful (though not required) to visualize your results. For this, you try to find a two-dimensional representation of your data points that preserves distances as far as possible. For this, look at Multidimensional Scaling (MDS). When you plot your data points, color each data point according to the cluster it is assigned for, and use other features (like dot characters, or letters) to indicate other data features of interest. You may be able to visually see patterns.

Thank you Stephan, I have done one model using check-in, check-out time in seconds as parameters to clustering algorithm. Is it wise to add "building" feature into the same ? or should I create a separate model for building wise clusters ? — Aljo Jose, Oct 11 '17 at 09:28
That really depends on what you want to use the resulting clusters for. — Stephan Kolassa, Oct 11 '17 at 09:29

Cluster Employees based on multiple Check in-Check out time in a day

1 Answers1