Is this expected from Gaussian Mixture Model? Given a perfectly homogenous dataset, the cluster center is not exactly the same as the data point?
//Create a vector (180,3)
val v = Vectors.dense(180.toDouble,3.toDouble)
//Create an array with all the elements set to 'v'
val tVrdd = sc.parallelize(Seq.fill(1000000)(v))
//Cluster the dataset into 10 clusters
val gmm = new GaussianMixture().setK(10).run(tVrdd)
//What's the clusterCenter?
scala> gmm.gaussians(0).mu
res11: org.apache.spark.mllib.linalg.Vector = [180.0000000000454,3.000000000001699]
As a note, I figured I can use KMeans to determine number of clusters and then use that to set "k" for gaussian mixture.
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
val numClusters = 10
val numIterations = 20
val clusters = KMeans.train(tVrdd, numClusters, numIterations)
scala> clusters.k
res12: Int = 1
val k = clusters.k
val gmm = new GaussianMixture().setK(k).run(tVrdd)
//What's the clusterCenter now?
gmm.gaussians(0).mu
scala> gmm.gaussians(0).mu
res13: org.apache.spark.mllib.linalg.Vector = [180.0,3.0]