0

I clustered some data (rows: text documents, columns: word frequencies) using the KMeans implementation in Scikit Learn. This, like most other centroid-based clustering algorithms, returns a centroid for each cluster.

I am now trying to identify features predictive of each cluster by simply comparing the value for a certain feature in a centroid to the other centroids. I just subtract the value of centroid 1's feature X from the value of centroid 2's feature X (cf. below) and then sort the differences to see which features have the widest spread and and are thus most strongly correlated with a cluster.

  • Does this approach make sense / work?
  • If not, why not and what should I do instead? (I found this but don't know how exhaustive it is)

I feel that the approach is kind of simplistic, but can't think of a much better way either.

More detailed description of idea, just for clarification:

In a 3 cluster k means, I have the vector for centroid 1 [1,2,0,...]. for centroid 2 [10, 2, 400,...] and centroid 3 [100, 2, 0, ...]. I then subtract feature 1: abs(1-10), abs(1-100), feature 2: abs(2-2), abs(2-2), etc. This I would interpret to mean that feature 1 is strongly predictive of cluster 3 membership, while feature 2 is not very predictive at all. In real life, a feature of course translates into a word. All my features are normalized as feature/total words. Any help is much appreciated!

patrick
  • 140
  • 6

1 Answers1

1

If it helps, this is sort of how I built my talking teddy bear, which is an AI driven twitter engine of sorts (but its a physical bear).

The next step is of course, don't forget combinations of features being accelerators. There is a really interesting example on Microsoft's Azure Machine Learning gallery for this doing a clustering of similiar companies which provides a great starting point. Basically it does is feature hashes each companies wikipedia page into n-grams rolls up counts and fits them into various clusters. Works pretty darned well, I am using that strategy for identifing similiar companies for a financial application to diversify portfolios.

Advise: Start simple, if its not accurate enough, do more. If its accurate enough for the ROI, then stop, ship and start the next thing.

David Crook
  • 111
  • 5
  • thanks! just so i understand correctly: you'd say the approach is valid, and to get even more accurate one could consider interactions/ correlations between features? – patrick Apr 28 '16 at 17:43
  • Exactly. I typically start with the simplest fastest approach and build from there if necessary. – David Crook Apr 28 '16 at 17:49
  • Just for good measure, check out the sample in Azure Machine Learnings gallery. Most of the sample is written in R, but is mostly stock R and therefor fairly easily transferrable to python. – David Crook Apr 28 '16 at 17:50