Clustering experiments based on event distance matricies

Question

I'm running a bunch of experiments with randomly picked "knobs", and I'm recording various event types and times they occurred during the event. I'm particularly interested in getting a good variety of events happening simultaneously, so I process the event timings to create a matrix that show the number of times two events happened near each other, like this:

   e1 e2 e3 e4 e5
e1    
e2  6
e3 11  2  
e4  0 11  4 
e5  1 14  1 15

My goal is to cluster the experiments based on the above data, finding big clusters that produce similar data (so I run less of them), and find outliers/small clusters (so that I can run more of them and even things out).

What would some appropriate clustering algorithms be to deal with data like this?

Using what I'm familiar with, I could normalize all experiments and then compare the matrices and calculate the distance between any two experiments... then use MDS to convert to 2D locations, and use DBSCAN to cluster. That, however, seems like a lot of steps where data can turn from good to useless if I'm not carefully tuning each step.

Is there some simpler methodology to determine similarity of a bunch of matrices, and highlight those are most dissimilar from others?

Update: Adding more clarity (hopefully :) ) To simplify things, lets ignore what matrices represent and just say that I have N observations, where each has a 2D set of attributes. How do I cluster the observations, with the goal of finding those are that are the most different from others?

That looks almost like a graph of a network. Here is a question that I asked whose matrix looks like yours. Perhaps the answer and approach are relevant too. http://stats.stackexchange.com/questions/139490/approach-and-example-of-graph-clustering-in-r — EngrStudent, Sep 13 '15 at 18:57
This is hard to follow. Do you want to cluster on your response (Y) data? Is it 1D? — gung - Reinstate Monica, Sep 13 '15 at 19:00
@EngrStudent Thanks! However, that deals with clustering of the data in a single matrix -- while that's interesting in itself, I'm looking to cluster N matrices based on how similar they are. — Stan, Sep 13 '15 at 21:04
If you are measuring the same attributes, then you can convert the measurements to a scalar by measuring distance to the ensemble mean. Do you replicate experiments? — EngrStudent, Sep 14 '15 at 10:49

score 3 · Accepted Answer · answered Sep 14 '15 at 06:09

If I understand your question correctly, then you have a number of experiments, and each experiment produces such a matrix.

While you could simply serislize the matrix into a vector and then try a bunch of clustering algorithms, I would not recommend this.

Dumping the data as-is into a clustering algorithm barely ever works for real data.

You need to preprocess the data and guide the clustering algorithm, so that it does not produce

trivial results (everything is one big cluster)
biased results (e.g. using only the attribute with the largest range)
obvious results (e.g. two clusters of customers: male and female - a correct result, but useless if you already know the gender)
random results (e.g. on uniform data, k-means will still find clusters. But they are only as good as a random division of the data space into two halves of approximately the same size)

For your problem, I suggest trying hierarchical agglomerative clustering, DBSCAN, and MDS. I suggest the following:

Discuss similarity with the people who did the experiments
Try to formalize this similarity/distance
Use MDS to visualize the data
If there are no clusters visible in the projection, return to 1 - if you can't see interesting clusters in the visualization, the algorithms usually won't work well either!
Try DBSCAN and HAC clustering using the similarity from 2.
HAC: Does the dendrogram exhibit a cluster structure? If not, return to 1. DBSCAN: did you get more than 1 cluster of non-trivial size (not too big, not too small)? If not, try varying the minPts/eps parameters in step 5, or improve similarity in steps 1+2.
Visualize the result using the projection of 3. Did it capture good clusters?
If no good clusters were captured, return to 5 and vary the parameters, or return to 1 to improve the similarity
Analyze the clusters: what are they?
If you cannot explain the clusters, return to 1.
Present the results to the experimenters, ask for feedback, and try starting from 1 again even when the results were already interesting

Figuring out a good measure of similarity is essential. It's probably 80% of your work when clustering such a data set, that is why I keep sending you back to step 1.

Thanks, that is very helpful! I've been going through a very similar flow to tune the analysis of each individual experiment for a different purpose, but fell into the trap of "need a better clustering algorithm" for the clustering of the experiments themselves. The key is step 2, really -- figuring out how to calculate the similarity the best way instead of dumping the data into some algorithm to do it for me. — Stan, Sep 14 '15 at 21:48

Clustering experiments based on event distance matricies

1 Answers1