3

I've just started reading about clustering and classification. It's a djungle, a fascinating one. Currently, however I have a rather urgent task, i.e to perform a sort of cluster analysis in the sense that I'd like to cluster my patients according to their phenotypes (biomarkers - continuous and categorical variables) and examine whether survival differs according to cluster. I'm not interested in any specific predictor, the purpose is merely to examine whether there are specific clusters of patients and whether the phenotypes associate with outcomes.

I'm looking for general advice on what type of method to use as well as recommended R package. I have 10 variables that are relevant for the phenotype. I could attach some data but I doubt it would contribute to the question, which is of more general character.

Thanks in advance.

Update: I'm looking for pros and cons of various techniques, with application to these kind of data. And I humbly understand that clustering may not be that straight forward.

Adam Robinsson
  • 2,083
  • 3
  • 19
  • 39
  • 1
    have you found anything on a pubmed search? I assume a lot of hierarchical clustering or kmediods on Gower distance metric. there is the Shi et al `Unsupervised Learning With Random Forest Predictors` - not sure if others have also used it with success – charles Jan 17 '16 at 15:24
  • Many thanks @charles. The paper was absolutely great, and it was accompanied by a separate programming tutorial as well. Very nice, thanks. – Adam Robinsson Jan 17 '16 at 22:14
  • hope useful. but `caveat emptor`. I haven't tried it. all of clustering seems a tricky business without great answers. – charles Jan 18 '16 at 13:35
  • 1
    Have you read [Choosing clustering method](http://stats.stackexchange.com/q/3713/17230)? (And [Hierarchical clustering with mixed type data - what distance/similarity to use?](http://stats.stackexchange.com/q/15287/17230) for more about the Gower similarities @charles mentions.) Also I was wondering:-Why not just predict outcomes directly from the biomarkers? Are the biomarkers thought to be only indirectly relevant to the outcomes (through predicting invisible classes to which patients belong)? – Scortchi - Reinstate Monica Jan 18 '16 at 18:12

1 Answers1

3

There is no universal clustering solution.

You need to try lots of different methods and spend a lot of time on preprocessing and visualizing your data. Sorry.

See also:

Estivill-Castro, V. (2002). Why so many clustering algorithms: a position paper. ACM SIGKDD explorations newsletter, 4(1), 65-75.

Clustering is an art, and cannot be automated in a meaningful way with decent results.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Thanks @Anony-Mousse. I'm not looking for a definitive answer. I understand that these things are very complex. I'm looking for guidance on suitable methods, as there are many more methods than presumably necessary for my purpose. Thanks for the reference, I'll read it with great interest! – Adam Robinsson Jan 17 '16 at 22:13