Suggestions for multi-dimensional clustering

Question

I am working in a genomics project and I ended up having a huge table with around 800 measurements (cases/rows), around 200 channels (columns/continuous variables) and 5 categories (one categorical column)

I would like to do two things:

Try to find sub-groups in the different levels of the categorical variable that I already have
create a new classification of these 800 measurements based only in the information

I have been doing my homework and read about using different strategies like (k-means or PCA) but I have found that it is very useful to get rid of redundant variables. How can I choose these properly?

Someone recommended me to use multinomial regression, any good resource you recommend to have a bite?

I am using R. Many thanks

First off: is this a supervised or an unsupervised learning problem? I get the sense it's the former, as you mentioned multinomial regression and the 5 categorical levels. Are they 5 distinct levels, or ordinal factors? If your goal is to reduce redundant variables, have you considered using the lasso for log-linear multinomial logit model? The book, Elements of Statistical Learning by Tibshirani and Hastie is online and free. — AdamO, Dec 20 '11 at 18:53
Thanks for your answer. Based on it I have read a bit and I can be a bit more precise. I guess I want to do both, supervised and unsupervised and then put biology in between and make some kind of conclusions. I will read about all the suggestions you made, thanks again. — pedrosaurio, Dec 21 '11 at 11:41

score 2 · Accepted Answer · answered Dec 31 '11 at 07:18

2

There is a fairly new technique that Tibshirani and his student developed called "sparse clustering". I think it is meant exactly for this situation, where there are many predictors but we would like to find a small subset of them that really matter. It is available in R as the "sparcl" package, implementing a sparse version of k-means and hierarchical clustering.

answered Dec 31 '11 at 07:18

JCWong

1,392
1
15
29

Thanks, I will have a look. For the moment I have plounged into a Machine learning course so I understand all the different alternatives out there and basics. I will definitely try your suggestion afterwards. – pedrosaurio Jan 02 '12 at 15:12
1

Witten's page says "sparcl ...for clustering a set of n observations when p variables are available, where p>>n", but here I gather n=800, p=200 ? – denis Feb 08 '12 at 11:45
I some how missed your comment Denis,sorry. Yes apparently this method performs better when p (features) >> n (observations), nevertheless the paper states that even if it is not optimal the method can also be used when p < n – pedrosaurio Mar 02 '12 at 12:32
@JCWong: Hi! I am happy to see that you're using the R package `sparcl`. Do you use that often? I have a question on the results produced by a function in the package... – alittleboy Jan 08 '13 at 06:56

Suggestions for multi-dimensional clustering

1 Answers1

Linked