-1

Say that we got a set $\mathcal{X} = \{x_1, x_2, \ldots\}$ of samples. You want to partition $\mathcal{X}$ into $k$ subsets, where $k$ is unknown besides the fact that $k \ge 1$. How clustering problem can be forumlated in probabilistic terms (as opposed to heuristics using distance measures)?

Here is an example: let $x$ be a sample, then for any cluster $i \in \{1, 2, \ldots, k\}$ find the probability $\Pr(C=i | X=x)$, where $C$ and $X$ are random variables that take values in sets $\{1, 2, \ldots, k\}$ and $\mathcal{X}$, respectively. How can you move forward with this approach? We don't even know $k$...

Tim
  • 108,699
  • 20
  • 212
  • 390
caveman
  • 2,431
  • 1
  • 16
  • 32
  • 3
    Probability of what? There are probabilistic Finite Mixture Models (https://stats.stackexchange.com/questions/130805/are-there-any-non-distance-based-clustering-algorithms/130810#130810) but I am not sure if this is what you are asking. You can check also https://stats.stackexchange.com/questions/23472/how-to-decide-on-the-correct-number-of-clusters – Tim Jan 30 '16 at 08:08
  • I think Tim's comment is the correct interpretation. However, as a 'keyword' I would consider Expectation-Maximisation (EM ) algorithm see https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm (which even has an animated gif) – seanv507 Jan 30 '16 at 14:39
  • @Tim you are free to formulate any probabilistic question as long as you satisfy one condition: answering it is a solution to the clustering problem. The clustering problem is: for any sample $x \in \mathcal{X}$, find the probability that it belongs to every cluster. – caveman Jan 30 '16 at 16:09
  • 1
    @caveman but this is *your* question, so you have to formulate it to be answerable... – Tim Jan 30 '16 at 16:22
  • @Tim, I updated my question with an example. Does it help? – caveman Jan 30 '16 at 16:31
  • 1
    If you define it that broad see the first link that I posted that defines finite mixture model. Distribution of $X$ is defined as mixture of $k$ distributions such that: $f(x, \vartheta) = \sum_k \pi_k f_k(x, \vartheta_k)$ with parameters $\pi_k, \vartheta_k$, the parameters and $k$ itself have to be found. – Tim Jan 30 '16 at 16:39
  • @Tim, thank you. Perfect. :) -- I am hooked now. So by that approach we model the PDF of each sub population (hard), from which we can later obtain the probabilities in my example above (easy). Now, would you like to post that as an answer so that I accept it? – caveman Jan 30 '16 at 17:04
  • 1
    @caveman I may provide such answer, but you have to edit your question to make it more precise and clear. As for now, as question is unclear, it is put on hold and nobody can answer it. We want this site to be knowledge repository, so we are looking for high-quality questions and answers that will be helpful not only for those who ask, but also for others. Editing to add information on probability of what you are interested in this problem would make this question more clear. – Tim Jan 31 '16 at 08:49
  • 1
    I edited it already hours after it was put on hold (thus expanded question section with example) and I don't see how more precision can be added there because the question itself is very broad. Maybe I am too stupid to see what I must do in order to comply. Anyway sucks to be me I guess. Thank you and highly appreciate your suppport – caveman Jan 31 '16 at 18:14
  • 1
    @caveman I edited your question, feel free to revert or change my edits if you find them unsuitable. – Tim Feb 01 '16 at 14:04

1 Answers1

2

There are

  • Gaussian Mixture Modeling
  • Akaike Information Criterion
  • Bayesian Information Criterion

that are somewhat based on probabilistic considerations.

But in the end you use clutering if you do not know how to model your data. I.e. when you have no idea of the probabilities.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Am I right to think that *mixture models* (without adding *Gaussian*, cause it could be a mixture of non-Gaussians) is the most comprehensive answer? – caveman Jun 03 '16 at 00:24
  • Also may I know what have you suggested AIC and BIC? These seem to be heuristics to predict which model is better than another in the absence of ground truth information. Some kind of intrinsic evaluation criteria. Am I right? – caveman Jun 03 '16 at 00:40