K means algorithm for Big Data Analytics

Question

I am working on Big Data Analytics, I would like to know how well k means algorithm can be used for clustering Big Data?

You should define the size of the data (big can be 10 000, 1 million, more ?) and give more information about it (dimensionality ?) — RUser4512, Nov 23 '15 at 23:16
Terabytes? Of **numerical** data? You cannot use k-means on tweets. — Has QUIT--Anony-Mousse, Nov 24 '15 at 16:10

Yang Li · Answer 1 · 2015-11-23T20:40:48.340

K-Means is good for large datasets if you're prioritizing speed

One of the main advantages of K-Means is that it is the fastest partitional method for clustering large data that would take an impractically long time with similar methods.

If you compare the time complexities of K-Means with other methods: K-Means is $O(tkn)$, where n is the number of objects, k is the number of clusters, and t is how many iterations it takes to converge. K-Medoids is $O(k(n-k)^2)$ for each iteration. Agglomerative hierarchical clustering is $O(n^3)$ (more efficient agglomerative clustering techniques are $O(n^2))$, while exhaustive divisive hierarchical clustering is $O(2^n)$, so you can see why these don't scale well for large datasets.

K-Means does have drawbacks

As I said, K-Means is what you want to use if you're looking for speed. However, bear in mind that K-Means has drawbacks:

You need to specify the number of clusters in advanced (you could try a range of numbers, or use mathematical methods to calculate the optimal number).
It struggles to find clusters which don't have a roughly spherical shape
It struggles to find clusters of vastly different sizes
It is less able to handle noise/outliers compared to other methods.

You can also use other methods if you take a sample of the data

The methods which do not work well with large datasets are often more accurate than K-Means (which is why they take longer). You can use these methods if you take a sample of the large dataset. You could take a random sample, or a biased sample which exaggerates denser areas. CLARA is an option for sampling a large dataset to use K-Medoids on.

Source: I work with clusteing, I am not an expert but I have experience with K-Means and other methods.

+1. Welcome to the site @YangLi. If you have big-O notation or any other mathematical formulas you want to show you can use `$...$` to have them render appropriately. So `$O(n^3)$` will appear as $O(n^3)$. — usεr11852, Nov 23 '15 at 20:36

Has QUIT--Anony-Mousse · Answer 2 · 2015-11-23T23:13:05.967

k-means is useless for "big data"

there is no "big data" suitable for k-means.

The only big data you have is unstructured text and video. K-means cannot be used on such data.

k-means only works on low-dimensional, continuous numeric, dense data. So if you have say 10 dimensions (at this point, k-means barely works usually) with double precision, you need 80 bytes per object. You can fit billions of objects into a single machine and it will outperform your cluster.

k-means does not benefit from big data.

It computes averages. You do not find significantly better averages by adding more data. Running k-means on a just-large-enough sample yields virtually the same result.

k-means is an unreliable, crude heuristic that needs to be used with care.

You can't just throw data at k-means, run it, and expect anything useful out of it. It is too sensitive to preprocessing, the k parameter, feature selection, feature engineering and so on. Whenever you run k-means, you need to carefully validate the outcome. You cannot automate this. So how would you validate your "big data" outcome? You will make some error in preprocessing or using k-means and you will never know.

Of course every new big data machine learning toy adds "k-means clustering" to their feature list, because it is about the easiest algorithm you can find (little more than computing a single mean). But I have never seen an example where k-means was used for "big data" with success, and I have never seen a good cluster implementation either - only crappy versions copied from the average textbook.

Thank you for this answer. It surely gave an insight. Do you suggest that k means could be extended to suit Big Data? — SSoans, Nov 24 '15 at 14:05
I don't think any clustering is successfully used with big data. What people use on big data is 1) ETL/preprocessing, and 2) supervised learning. For clustering, **just use a sample**, you will be fine. — Has QUIT--Anony-Mousse, Nov 24 '15 at 16:06
Thanks Anony Mousse. I am planning a proposal on K means algorithms enhancement suitable for big data. As you say K means will fail , please can you suggest any other algorithm which I can work on? Is there any research scope? — SSoans, Nov 30 '15 at 13:16
I don't think so. Start with a *problem* to solve, so you don't get stuck when you have to show it is useful. "big data k-means" has been solved 10 times, but useful 0 times. — Has QUIT--Anony-Mousse, Nov 30 '15 at 18:43

K means algorithm for Big Data Analytics

2 Answers2

k-means is useless for "big data"