1

I have a dataset that has 5 features; 2 continuous and 3 categorical. If I use one hot encoding on each of the categorical features, I end up with some 600 features for each observation. I then use tsne to reduce the dimensionality of the data from ~600 to 2. I plot the tsne results in a scatter plot. You can see some clear patterns and clusters, and if you zoom in, the data in the clusters make sense i.e points in a cluster are related to each other in some way.

My question is this: would it make sense to train a machine learning model, say clustering or some density model, on the 2-d results of the tsne output? Or is that a statistical “no-no” and I should really only use tsne for visualisation and train my model on the original data with 5 features.

PyRsquared
  • 1,084
  • 2
  • 9
  • 20
  • 3
    Someone correct me if I am wrong here, but as far as I know you cannot use t-SNE on a new dataset. If that's the case then you will not be able to apply your t-SNE "trained" on a training dataset on a new data. In other words - it seems like with t-SNE you cannot add new data on already existing plot. – Karolis Koncevičius Sep 24 '17 at 09:20
  • And see this answer why clustering on tSNE can be *very* misleading, because **tSNE does not preserve distances**: https://stats.stackexchange.com/a/264647/7828 – Has QUIT--Anony-Mousse Sep 24 '17 at 19:49
  • Possible duplicate of [Why is t-SNE not used as a dimensionality reduction technique for clustering or classification?](https://stats.stackexchange.com/questions/340175/why-is-t-sne-not-used-as-a-dimensionality-reduction-technique-for-clustering-or) – usεr11852 Jul 08 '18 at 21:36

0 Answers0