Spark MLLib Gaussian Mixture Model feature or bug

Question

Is this expected from Gaussian Mixture Model? Given a perfectly homogenous dataset, the cluster center is not exactly the same as the data point?

//Create a vector (180,3)
val v = Vectors.dense(180.toDouble,3.toDouble)

//Create an array with all the elements set to 'v'
val tVrdd = sc.parallelize(Seq.fill(1000000)(v))

//Cluster the dataset into 10 clusters
val gmm = new GaussianMixture().setK(10).run(tVrdd)

//What's the clusterCenter?
scala> gmm.gaussians(0).mu
res11: org.apache.spark.mllib.linalg.Vector = [180.0000000000454,3.000000000001699]

As a note, I figured I can use KMeans to determine number of clusters and then use that to set "k" for gaussian mixture.

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val numClusters = 10
val numIterations = 20
val clusters = KMeans.train(tVrdd, numClusters, numIterations)

scala> clusters.k
res12: Int = 1

val k = clusters.k

val gmm = new GaussianMixture().setK(k).run(tVrdd)

//What's the clusterCenter now?
gmm.gaussians(0).mu

scala> gmm.gaussians(0).mu
res13: org.apache.spark.mllib.linalg.Vector = [180.0,3.0]

k-means is an old heuristic method for clustering data. There are better ways of determining cluster sizes. — Jon, Jul 11 '17 at 23:35
From reading your code, it looks like you create an array of 180 & 3 then you use GMM to find the cluster centers, is this correct? — Jon, Jul 11 '17 at 23:36
Yes, I am picking perfectly homogenous data to make the point that given perfectly homogenous data, GMM cluster centers are not the same as our provided data point. What's happening is - I have a large dataset that is mostly heterogenous except for subsets every now and then that are perfectly homogenous. So when GMM clusters these homogenous datasets, it places cluster center off where the center is supposed to be and that has other implications. — Joe Nate, Jul 12 '17 at 00:18
I am not sure what you mean, when you say "k-means is an old heuristic method" and "there are better ways". Why is it old/outdated and what are better methods? — Joe Nate, Jul 12 '17 at 00:21
So to summarize, if your question is why the centers are `[180.0000000000454,3.000000000001699]` vs `[180.0,3.0]`, that's a result of numerical methods occurring in GMM. It's *estimating* what the centers should be, and this will contain some slight error regardless of how homogeneous the data may be. — Jon, Jul 12 '17 at 16:58
Regarding k-means, I will fwd you to this thread: https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means/133694#133694 — Jon, Jul 12 '17 at 16:59
If you don't care to read that thread, here is a well put presentation on density based methods https://www.youtube.com/watch?v=5cOhL4B5waU&t=935s — Jon, Jul 12 '17 at 17:00
Thanks for the both the links and answers, Jon. If you post your comments as an answer, I can mark the question resolved. — Joe Nate, Jul 12 '17 at 17:47

score 1 · Accepted Answer · edited Jun 11 '20 at 14:32

1

To formalize an answer to this post, the Spark documenation reads

Gaussian Mixture Model (GMM)

A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. The spark.ml implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.

Given your situation that you are creating a dense matrix of homogenous data, your question is why are values not exact to what you expected, i.e. [180.0000000000454,3.000000000001699] vs [180.0,3.0].

Well, because the model uses an E-M algorithm, it iterates through values that come closer and closer to the true values but never really reaching them. This is just a result of the "computational error".

You can read more about the E-M algorithm in Gaussian Mixture Models here.

I should have added that the algorithm above is maximizing the log-likelihood parameter estimates using conditional expected values (usually values from a previous iteration). So the GMM will come close to estimating the true mean (centers) of the clusters but will never actually give you the true values.

edited Jun 11 '20 at 14:32

Community

1

answered Jul 12 '17 at 18:05

Jon

2,180
1
11
28

Unfortunately, DBSCAN hasn't yet been implemented yet for Spark ML/MLLIb and it's beyond my skills to create a lib :) https://issues.apache.org/jira/browse/SPARK-5226 A couple of implementations exist in the wild but stability/efficiency isn't well know, apparently. – Joe Nate Jul 12 '17 at 22:12
Not sure if Spark gives you BIC (you can probably write up a function to compute it), but you should use BIC to compare cluster results and come up with an optimal cluster size for the GMM model. – Jon Jul 12 '17 at 22:34
Note that GMM is sensitive ( like kmeans++) to initial cluster size so that's why you should use the BIC to compare. It's what makes GMM much more advantageous over k-means algorithms. – Jon Jul 12 '17 at 22:35
1

Thanks again for the pointers. I looked at using kmeans to initialize GMM but shelved that due to some issue with the GMM itself. Once I have the whole application workflow done, I will definitely investigate using BIC to initialize GMM. I did see this note about GMM initialization: https://www.mathworks.com/help/stats/clustering-using-gaussian-mixture-models.html Need to read/research/test more. Thanks. – Joe Nate Jul 13 '17 at 00:48

score 0 · Answer 2 · answered Jul 13 '17 at 06:17

0

Floating point numbers only provide about 7-8 (single precision) and 15-16 (double precision) decimal digits of precision.

So the result is entirely within what would be called "precise".

Yet I am a bit surprised that this happens that easily. Did you try if this also happens in other implementations, such as ELKI?

answered Jul 13 '17 at 06:17

Has QUIT--Anony-Mousse

39,639
7
61
96

Haven't tried ELKI yet. Will try to compare and post results. Also, Spark-R. Need to better understand limitations of non-Spark-native libs, in terms of being able to deal with distributed computing. – Joe Nate Jul 13 '17 at 18:44

Spark MLLib Gaussian Mixture Model feature or bug

2 Answers2