Let's say I have check-all-that-apply survey question. What kind of analysis can I run to understand if there are meaningful clusters (i.e. there's a cluster of people who choose A, B, C, and another cluster that choose A, D and E)?
Asked
Active
Viewed 818 times
0
-
Please be more specific with your question. If there are people who chose A and B then they will show up in clusters choosing (A,B,C) and (A,B,D). What sort of clusters are you particularly interested in? – Sid Sep 19 '14 at 07:27
-
2Multiple response data are data of binary variables. There exist a lot of proximity metrics for such data (Jaccard being among the most popular). You base your clustering on the matrix of such distances. – ttnphns Sep 19 '14 at 07:59
1 Answers
1
Two key approaches:
Consider the options to be binary, and use an appropriate distance measure such as Jaccard similarity, which he developed for biological research: "check all species that live in this region". Then you have a wide variety of clustering algorithms available.
Use frequent itemset mining, and check if you have interesting frequent patterns. The benefit is that these patterns may overlap, and can be transformed to rules such as "users that chose A and B also picked C in 90% of the cases".

Has QUIT--Anony-Mousse
- 39,639
- 7
- 61
- 96
-
Can you recommend a source to read about frequent itemset mining, including its algorithm, so that one may want to try to code it oneself? – ttnphns Aug 26 '17 at 07:02
-
That should be covered in every data mining book. Apriori is the most known data mining algorithm (and it's one of the top 10 important data mining algorithms: https://link.springer.com/article/10.1007/s10115-007-0114-2 ). Although I'd rather use FPGrowth which is more clever (but also more complicated). – Has QUIT--Anony-Mousse Aug 26 '17 at 09:25