How to reduce the number of variables in cluster analysis?

Question

I've got 10 (yes, only 10) cases over 1000 variables (e.g. measurements of concentrations of 1000 different compounds at 10 different time points). I can group these cases into 3 clusters in 1000-dimensional space (complete linkage, cluster sizes 3, 3, and 4). This partitioning agrees with my expectations, but the clusters are not very well-defined. I suspect that some variables give no or little information, some are noise, and some others are responsible for this particular partitioning. I would like to find out the latter ones, that is, to reduce the number of variables (e.g. to 100-200), so that the cases are partitioned into the same 3 clusters, and these clusters are significantly better defined than the original ones (e.g. by silhouette coefficient). This should be a subset of the original variables, not some new unobserved ones.

I have the following ideas:

Go through the variables one-by-one and compare cluster solutions in each 1-dimensional space to the original solution. Keep only those variables which produce similar results. Not sure if this would work.
Go through all the variables in original solution and remove one whose deletion results in the maximum increase in some kind of goodness measure like silhouette coefficient, repeat.
Attempt to find out those variables which are responsible for most of the variation, e.g. by doing a multidimensional scaling into a few dimensions, and then fit this back into original 1000 dimensions using procrustes rotation, keeping the ones which fit better. This would only work if only a few variables are responsible for the variation, as I understand.
Delete variables with lowest predictor importance?

Would any of this work? Should I try anything else?

score 7 · Answer 1 · answered Oct 09 '12 at 16:54

The problem with dimensionality reduction and number of variables >> number of observations is that the $k$ observations that you have define an at most $k-1$ dimensional hyperplane on which the objects perfectly are located on.

So yes, anything more than 9 dimensions still has proven redundancies.

Many dimension reduction techniques - in particular PCA, SVD, but probably also MDS etc. - will essentially try to preserve this hyperplane.

Don't you have a way to reduce the number of dimensions that uses domain information that you have? I.e. if you know that your dimensions are expected to be highly correlated, remove dimensions that are the most correlated (pairwise is probably best). But note that even correlation is not very stable to compute when you have just 10 observations. You lose one degree of freedom for the mean, for example, which you can't really afford.

Unfortunately I don't have much domain information, rather, I am trying to gain it by finding out which variables produce this particular effect. The only thing that I expect a priori is that their amount is about 100-500, not 3 or 5. I am also not sure I'd want to remove the ones which are highly correlated, unless they affect the goodness of my clustering negatively. — dan, Oct 09 '12 at 17:55
If they don't affect it positively, you can remove them safely. They can be irrelevant but not harm easily, just by having a very low magnitude. — Has QUIT--Anony-Mousse, Oct 09 '12 at 19:05

score 6 · Answer 2 · edited Jan 01 '17 at 22:58

6

Here's something else you can try.

This data is similar to what they see in genomics, so you could look to that field for ideas of analysis. In genomics there are lots of variables (20,000+), many of which are highly correlated with each other, and a relatively small number of rows.

If this were a genomics problem, 5 of your rows would be healthy controls and 5 would have some kind of malady and you'd want to find the genes (variables) which help you identify the disease - i.e. feature selection is the main problem.

In your case if you don't know that some of your rows are "good" and some is "bad", you could still use a technique that is good for datasets like this - a random forest.

R's randomForest library does unsupervised clustering. In a nutshell, it will combine your 10x1000 matrix with another 10x1000 matrix consisting of random noise. It then tries to build a model to differentiate between your matrix and the noise.

If you do know some of the rows are "good", then you could still use randomForest - just use it in "supervised" mode.

Regardless, a nice "side effect" of a random forest is you could then examine the importance() of each variable - a variable's importance is measured by averaging its performance across all the trees in used in the random forest.

You could sort this list in descending order of importance, take the top x number of variables, and consider these to be the ones accounting for the most variance in your matrix.

You'd also want to check out the importance metric itself - plot it maybe. If it's flat across the entire range of variables, then no one variable(s) is more predictive than any other. But if, as you suspect, some variables account for more variance, you should see a scree type plot.

I love Random Forests. They are really fast and "embarassingly parallel". They have a weakness of over emphasizing discrete variables with lots of values (e.g. State). That doesn't seem to apply here.

EDIT: Link to Breiman's site. A pretty good explanation.

edited Jan 01 '17 at 22:58

kjetil b halvorsen

63,378
26
142
467

answered Oct 12 '12 at 12:16

Wake2Sleep

441
2
7

the idea of using random forest for clustering is really nice. just a question though: how will it interpret if in the "original matrix" one of the variable has a constant value ? Won't it appear as one of the rules, and that variable as one of the important one to distinguish our data from random noise ? – nassimhddd Oct 12 '12 at 21:32
Thanks for the idea. My rows (observations) are partitioned not into 2 ("good" and "bad"), but into 3 clusters, and I want to keep the variables which distinguish between them (that is, keeping these variables and discarding others results in clusters becoming tighter and better separated). So I have a 10x1000 data frame and a vector of length 10 specifying which cluster does which case belong to. What would the 'formula' or 'response vector', which randomForest needs, be in this case? ..Isn't this also similar to what Boruta package does, using randomForests? – dan Oct 12 '12 at 23:56
@cafe876 - not my idea! All credit goes to Leo Breiman. I will edit and add a link. – Wake2Sleep Oct 13 '12 at 10:30
1

@dan - I had never heard of the Boruta package - it uses the randomForest package with some added on logic. I'm going to check it out - thanks for the tip! For your problem, you can still use randomForest in supervised mode with a formula like "ClusterId ~ ." - i.e. predict the Cluster using all the other variables. Then you can examine the importance(). You might need to convert your ClusterId to a string so that randomForest can figure out that you want to do Classification Tree and not Regression Tree (it'll do both). I hope this helps... – Wake2Sleep Oct 13 '12 at 10:45
@cafe876 - Sorry forgot to answer your question - I would guess that yes, a constant would show up as very important. As always, it's up to the analyst to make sense of what the computer is telling us. – Wake2Sleep Oct 13 '12 at 10:49
@Wake2Sleep - thanks! i do indeed get some results when using ClusterId~. with Boruta or just analyzing importance() of randomForest. However, they are not what one would expect. the "important" variables are mostly the ones with low observed values (concentrations), which thus do not affect clustering significantly. if one throws them away, the clustering remains the same. so they are not "important" as per my understanding of the word. Of course, one could throw away the variables with low observed values before clustering, but one wonders what is the sense of using this method then... – dan Oct 30 '12 at 15:55
@Wake2Sleep - so I don't quite understand whether it's this method which for some reason is not appropriate for this particular task, or is it me who cannot interpret the results correctly. In my book "important" variables are the ones which produce (ideally) the same clustering of observations as the whole set of variables, and removing them breaks this clustering altogether. Boruta thinks it has found the important variables, but they behave differently. – dan Oct 30 '12 at 16:01

score 2 · Answer 3 · answered Oct 09 '12 at 20:11

2

You could try transposing the data and computing principal components to see which cases load on which components. It might be necessary to rotate the results to get clearer clusters, but ideally you could end up with three good components with each representing the groups you are expecting. Even if that doesn't work, you could then use the principal component scores to cluster your compounds into a smaller number of groups and select the compound closest to the centroid in that group to represent the group.

answered Oct 09 '12 at 20:11

dcarlson

146
4

could you elaborate, please? if I compute princomp() on a transposed 10x1000 matrix, I get 10 components, of course. which ones would be 'good'? I also see that the resulting biplot produces 10 vectors which are grouped together in about the same way as the original observations are clustered. How do I find the centroids of the respective groups? thanks! – dan Oct 11 '12 at 21:39
The number of components to use depends on the eigenvalues each component. A sharp drop in the eigenvalue indicates that the remaining components carry very little information. I was expecting that you would get about 2-4 "good" components and that 2-4 of your 10 cases would have high loadings on each of the "good" components. Without seeing the results, it is impossible to tell. – dcarlson Oct 14 '12 at 15:56
It sounds like you have recovered the pattern you expected in two components. I was expecting 3 since you have three groups, so look at that third eigenvalue. The "distance" matrix here is the covariance (or correlation) matrix. So the clusters are defined by groups of variables with high correlations not by Euclidean distance. Do the points (concentrations) on the biplot show any clustering? – dcarlson Oct 14 '12 at 16:14

How to reduce the number of variables in cluster analysis?

3 Answers3

Linked