2

Application/Desire : I want to be able to cluster word2vec vectors using density based clustering algorithms (say dbscan/hdbscan; due to too much noise in data) using python or R. I cannot compute pairwise distance b/w vectors as the size is too big (>2.5 million vocab). DBSCAN/HDBSCAN in both R and python does not directly support cosine distance as a metric.

Question: Using the vectors (say 250 dimensions), if I am to reduce it using a non-linear dim reduction algo like T-SNE or autoencoder or SOM to say 50 dimensions, can I use euclidean metric to cluster using density based clustering algos? does the dim reduction algos also shift distance metrics? so that I can here use euclidean instead of cosine metric ??

Other suggestions are also welcome

Steffen Moritz
  • 1,564
  • 2
  • 15
  • 22

1 Answers1

2

For all I know, sklearn DBSCAN does support cosine.

There also is ELKI which I have used with Cosine and DBSCAN. You can add an index (e.g. Cover tree for arc cosine) to accelerate DBSCAN. It's very fast, and scalable. It often still works when the others (in particular sklearn) run out of memory if you set the Java memory limit parameter -Xmx.

But at 2.5 million points runtime may be several hours nevertheless. And you will need many iterations to tune parameters. You should consider sampling. At the least, use a sample to tune parameters and validate your approach first.

As for tSNE and clustering, K-means clustering on the output of t-SNE is a very visual explanation why tSNE should only be used for visualization and not for clustering. (Short story: it preserves some neighbors, but neither distances nor densities; so anything that is distance or density based like clustering must not be done afterwards).

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Thanks for your reply Anony.. I am newbie when it comes to Java programming. I understand that this is beyond the question that i asked just now. But do you have a working code for HDBSCAN in JAVA? or can you guide me where i can find some working code for HDBSCAN with cosine distance ? – Sundaresh Prasanna Oct 16 '17 at 09:55
  • ELKI already mentioned above also appears to have HDBSCAN* with support for cosine. – Has QUIT--Anony-Mousse Oct 16 '17 at 15:06
  • Can autoencoders be used on word2vec vectors prior to clustering? – StatguyUser Jan 31 '18 at 06:10
  • To solve what mathematical problem? Don't stack functions just because they sound cool. – Has QUIT--Anony-Mousse Jan 31 '18 at 07:13
  • word2vec gives a dense vector. I want to reduce the dimensionality of the vector using autoencoder. Then use the reduced vectors for DBSCAN clustering. – StatguyUser Jan 31 '18 at 09:16
  • Why do you think it's better to reduce the dimensionality with an auto encoder, rather than learning a lower dimensionality word2vec? – Has QUIT--Anony-Mousse Jan 31 '18 at 18:41
  • @Anony-Mousse multiple papers i have read in this subject suggest that vector as a feature perform well beyond 300 dimension, with increments of 100. If this is the case, we can build a doc2vec of 300 dimension and later do dimensionality reduction using autoencoder, before we process that for clustering. – StatguyUser Feb 01 '18 at 14:08
  • I know that 300 is a popular choice. But did any show that first learning 300 dimensions then reducing it performance better than *directly* optimizing the low dimensional representation? Because dimensionality redcution isn't free. It will reduce quality. – Has QUIT--Anony-Mousse Feb 01 '18 at 20:41