1

I am confused about what is a "sample" and what is a "variable" in a k-means model. Let's take a gene expression dataset which includes measurements from 1000 genes for 100 patients. When we are clustering the patients to 10 groups based on their gene expression, it is usually presented that the patients are the "samples" or the "data points" and genes are the "variables" or "features". Then vice versa, if we are clustering the genes to 10 groups based on their expression in those 100 patients, are the genes now the "samples" or are they still "variables"?

In case you wonder why I am concerned about those definitions, I am trying to compute the BIC (Bayesian Information Criterion) for different number of clusters for each of those 2 clustering tasks, and I am confused about what should be $n$ and $k$ in the BIC formula (https://en.wikipedia.org/wiki/Bayesian_information_criterion) for each of those 2 tasks.

user5054
  • 1,259
  • 3
  • 13
  • 31
  • In statistical notation, quite typically n is the number of cases, p is the number of variables and k is the number of groups or clusters. To understand better BIC criterion in cluster analysis, just search this site for `BIC clustering k-means`. – ttnphns Feb 10 '18 at 04:23
  • I already checked the related questions. I think you did not understand what I asked. I already know what $k$, $n$, etc. means in the BIC formula. I am asking what to put as values to those in the specific clustering tasks that I mentioned. What I asked is not about notation at all. – user5054 Feb 11 '18 at 03:22
  • Are you asking what will be likelihood function in case of k-means cluster analysis? If yes then find back the threads I've referred to and re-read them. – ttnphns Feb 11 '18 at 07:46
  • No, I am not asking that, I know how to compute the likelihood. As I wrote earlier, what I am asking is what to use as $n$ and $k$ in the constant part of the BIC formula -- $ln(n) k$ in the Wikipedia article that I linked -- in each of the two clustering tasks I mentioned. – user5054 Feb 11 '18 at 23:14
  • You should not bluntly follow the wikipedia formula which is general. You say you've looke though relevant threads. Here is [one](https://stats.stackexchange.com/q/55147/3277) where people discussed how exactly BIC for clustering could be computed. – ttnphns Feb 12 '18 at 07:31

0 Answers0