I'm doing an exploratory clustering analysis of student course schedules at a college. Interpretability by humans is paramount: we're trying to inform future research questions and possibly scheduling decisions. Success for this project would include observations like "many students take classes only on Monday/Wednesday" or "hey, look, there are a bunch of students who take morning classes at the Springfield campus and go over to the Franklin campus in the afternoon".
For each student, for each class the student took, I have the start time, end time, days of the week, and campus. I'm working with data from a single semester.
As a first pass, I'm doing agglomerative clustering in sklearn
and exploring the dendrogram by hand. The hard part so far is the question I'm posting here: what would be a reasonable distance metric between two students based on their course schedules?
My first attempt was to divide the week into 30-minute intervals and assign each student a 1 for every slot where the student was in class and a 0 elsewhere. For example, a student with a morning MW class, an afternoon TR class, and a Friday seminar would look like this:
Then I can compute the overlap between two course schedules using the Jaccard distance. But I'm not fully satisfied with this approach:
- Non-overlapping classes are all equally dissimilar. A class that starts at 9:00 is just as different from one that starts at 11:00 as it is from one that starts at 16:00.
- There's no "bonus" for classes at the same time on different days, or for classes at different times on the same day.
- It's not obvious how to extend this to include online classes (although I suppose I could make a new "day" just for online classes) or the campus where the class was located.