I've got 10 (yes, only 10) cases over 1000 variables (e.g. measurements of concentrations of 1000 different compounds at 10 different time points). I can group these cases into 3 clusters in 1000-dimensional space (complete linkage, cluster sizes 3, 3, and 4). This partitioning agrees with my expectations, but the clusters are not very well-defined. I suspect that some variables give no or little information, some are noise, and some others are responsible for this particular partitioning. I would like to find out the latter ones, that is, to reduce the number of variables (e.g. to 100-200), so that the cases are partitioned into the same 3 clusters, and these clusters are significantly better defined than the original ones (e.g. by silhouette coefficient). This should be a subset of the original variables, not some new unobserved ones.
I have the following ideas:
- Go through the variables one-by-one and compare cluster solutions in each 1-dimensional space to the original solution. Keep only those variables which produce similar results. Not sure if this would work.
- Go through all the variables in original solution and remove one whose deletion results in the maximum increase in some kind of goodness measure like silhouette coefficient, repeat.
- Attempt to find out those variables which are responsible for most of the variation, e.g. by doing a multidimensional scaling into a few dimensions, and then fit this back into original 1000 dimensions using procrustes rotation, keeping the ones which fit better. This would only work if only a few variables are responsible for the variation, as I understand.
- Delete variables with lowest predictor importance?
Would any of this work? Should I try anything else?