5

I am working in a genomics project and I ended up having a huge table with around 800 measurements (cases/rows), around 200 channels (columns/continuous variables) and 5 categories (one categorical column)

I would like to do two things:

  • Try to find sub-groups in the different levels of the categorical variable that I already have
  • create a new classification of these 800 measurements based only in the information

I have been doing my homework and read about using different strategies like (k-means or PCA) but I have found that it is very useful to get rid of redundant variables. How can I choose these properly?

Someone recommended me to use multinomial regression, any good resource you recommend to have a bite?

I am using R. Many thanks

pedrosaurio
  • 1,283
  • 2
  • 14
  • 19
  • First off: is this a supervised or an unsupervised learning problem? I get the sense it's the former, as you mentioned multinomial regression and the 5 categorical levels. Are they 5 distinct levels, or ordinal factors? If your goal is to reduce redundant variables, have you considered using the lasso for log-linear multinomial logit model? The book, Elements of Statistical Learning by Tibshirani and Hastie is online and free. – AdamO Dec 20 '11 at 18:53
  • Thanks for your answer. Based on it I have read a bit and I can be a bit more precise. I guess I want to do both, supervised and unsupervised and then put biology in between and make some kind of conclusions. I will read about all the suggestions you made, thanks again. – pedrosaurio Dec 21 '11 at 11:41

1 Answers1

2

There is a fairly new technique that Tibshirani and his student developed called "sparse clustering". I think it is meant exactly for this situation, where there are many predictors but we would like to find a small subset of them that really matter. It is available in R as the "sparcl" package, implementing a sparse version of k-means and hierarchical clustering.

JCWong
  • 1,392
  • 1
  • 15
  • 29
  • Thanks, I will have a look. For the moment I have plounged into a Machine learning course so I understand all the different alternatives out there and basics. I will definitely try your suggestion afterwards. – pedrosaurio Jan 02 '12 at 15:12
  • 1
    Witten's page says "sparcl ...for clustering a set of n observations when p variables are available, where p>>n", but here I gather n=800, p=200 ? – denis Feb 08 '12 at 11:45
  • I some how missed your comment Denis,sorry. Yes apparently this method performs better when p (features) >> n (observations), nevertheless the paper states that even if it is not optimal the method can also be used when p < n – pedrosaurio Mar 02 '12 at 12:32
  • @JCWong: Hi! I am happy to see that you're using the R package `sparcl`. Do you use that often? I have a question on the results produced by a function in the package... – alittleboy Jan 08 '13 at 06:56