I am working on Big Data Analytics, I would like to know how well k means algorithm can be used for clustering Big Data?
-
You should define the size of the data (big can be 10 000, 1 million, more ?) and give more information about it (dimensionality ?) – RUser4512 Nov 23 '15 at 23:16
-
the data is in Terrabytes – SSoans Nov 24 '15 at 14:10
-
Terabytes? Of **numerical** data? You cannot use k-means on tweets. – Has QUIT--Anony-Mousse Nov 24 '15 at 16:10
2 Answers
K-Means is good for large datasets if you're prioritizing speed
One of the main advantages of K-Means is that it is the fastest partitional method for clustering large data that would take an impractically long time with similar methods.
If you compare the time complexities of K-Means with other methods: K-Means is $O(tkn)$, where n is the number of objects, k is the number of clusters, and t is how many iterations it takes to converge. K-Medoids is $O(k(n-k)^2)$ for each iteration. Agglomerative hierarchical clustering is $O(n^3)$ (more efficient agglomerative clustering techniques are $O(n^2))$, while exhaustive divisive hierarchical clustering is $O(2^n)$, so you can see why these don't scale well for large datasets.
K-Means does have drawbacks
As I said, K-Means is what you want to use if you're looking for speed. However, bear in mind that K-Means has drawbacks:
- You need to specify the number of clusters in advanced (you could try a range of numbers, or use mathematical methods to calculate the optimal number).
- It struggles to find clusters which don't have a roughly spherical shape
- It struggles to find clusters of vastly different sizes
- It is less able to handle noise/outliers compared to other methods.
You can also use other methods if you take a sample of the data
The methods which do not work well with large datasets are often more accurate than K-Means (which is why they take longer). You can use these methods if you take a sample of the large dataset. You could take a random sample, or a biased sample which exaggerates denser areas. CLARA is an option for sampling a large dataset to use K-Medoids on.
Source: I work with clusteing, I am not an expert but I have experience with K-Means and other methods.

- 703
- 6
- 14
-
2+1. Welcome to the site @YangLi. If you have big-O notation or any other mathematical formulas you want to show you can use `$...$` to have them render appropriately. So `$O(n^3)$` will appear as $O(n^3)$. – usεr11852 Nov 23 '15 at 20:36
k-means is useless for "big data"
- there is no "big data" suitable for k-means.
The only big data you have is unstructured text and video. K-means cannot be used on such data.
k-means only works on low-dimensional, continuous numeric, dense data. So if you have say 10 dimensions (at this point, k-means barely works usually) with double precision, you need 80 bytes per object. You can fit billions of objects into a single machine and it will outperform your cluster.
- k-means does not benefit from big data.
It computes averages. You do not find significantly better averages by adding more data. Running k-means on a just-large-enough sample yields virtually the same result.
- k-means is an unreliable, crude heuristic that needs to be used with care.
You can't just throw data at k-means, run it, and expect anything useful out of it. It is too sensitive to preprocessing, the k parameter, feature selection, feature engineering and so on. Whenever you run k-means, you need to carefully validate the outcome. You cannot automate this. So how would you validate your "big data" outcome? You will make some error in preprocessing or using k-means and you will never know.
Of course every new big data machine learning toy adds "k-means clustering" to their feature list, because it is about the easiest algorithm you can find (little more than computing a single mean). But I have never seen an example where k-means was used for "big data" with success, and I have never seen a good cluster implementation either - only crappy versions copied from the average textbook.

- 39,639
- 7
- 61
- 96
-
Thank you for this answer. It surely gave an insight. Do you suggest that k means could be extended to suit Big Data? – SSoans Nov 24 '15 at 14:05
-
-
I don't think any clustering is successfully used with big data. What people use on big data is 1) ETL/preprocessing, and 2) supervised learning. For clustering, **just use a sample**, you will be fine. – Has QUIT--Anony-Mousse Nov 24 '15 at 16:06
-
Thanks Anony Mousse. I am planning a proposal on K means algorithms enhancement suitable for big data. As you say K means will fail , please can you suggest any other algorithm which I can work on? Is there any research scope? – SSoans Nov 30 '15 at 13:16
-
I don't think so. Start with a *problem* to solve, so you don't get stuck when you have to show it is useful. "big data k-means" has been solved 10 times, but useful 0 times. – Has QUIT--Anony-Mousse Nov 30 '15 at 18:43