41

In a recent assignment, we were told to use PCA on the MNIST digits to reduce the dimensions from 64 (8 x 8 images) to 2. We then had to cluster the digits using a Gaussian Mixture Model. PCA using only 2 principal components does not yield distinct clusters and as a result the model is not able to produce useful groupings.

However, using t-SNE with 2 components, the clusters are much better separated. The Gaussian Mixture Model produces more distinct clusters when applied to the t-SNE components.

The difference in PCA with 2 components and t-SNE with 2 components can be seen in the following pair of images where the transformations have been applied to the MNIST dataset.

PCA on MNIST

t-SNE on MNIST

I have read that t-SNE is only used for visualization of high dimensional data, such as in this answer, yet given the distinct clusters it produces, why is it not used as a dimensionality reduction technique that is then used for classification models or as a standalone clustering method?

amoeba
  • 93,463
  • 28
  • 275
  • 317
willk
  • 583
  • 1
  • 7
  • 12
  • 2
    Do you mean classification or clustering? The title says clustering but the post says classification. – usεr11852 Apr 12 '18 at 18:29
  • Sorry about that. I want to know why it is not used as a clustering technique or as a dimensionality reduction technique for classification. I've edited to reflect this. – willk Apr 12 '18 at 19:18
  • Coincidentally enough, a paper [released recently](https://www.sciencedirect.com/science/article/pii/S001021801830018X) uses t-SNE and an unsupervised clustering algorithm to label combustion processes. – tpg2114 Apr 12 '18 at 20:01
  • 2
    The answer that you linked demonstrates how misleading tSNE can be. You see clusters in the plot that do not exist in the data. That is harmful if you don't have labels. And don't draw too many conclusions from MNIST data. That is an extremely well behaved data set... – Has QUIT--Anony-Mousse Apr 12 '18 at 21:00
  • @Anony-Mousse Thanks for pointing that out. And there is always a danger in using too neat datasets! – willk Apr 13 '18 at 01:00
  • 2
    I've found [this article](https://t.co/wurP5pynUE) to be helpful in explaining t-SNE and its drawbacks. It has plenty of interactive visualizations that help emphasize the main points. – willk Apr 13 '18 at 01:04
  • It is used in this way though. This technique is common in single-cell genomics analysis and has been used (in exactly the way described) in many recent high-profile papers. – Tom Kelly Sep 13 '19 at 05:10

3 Answers3

40

The main reason that $t$-SNE is not used in classification models is that it does not learn a function from the original space to the new (lower) dimensional one. As such, when we would try to use our classifier on new / unseen data we will not be able to map / pre-process these new data according to the previous $t$-SNE results.

There is work on training a deep neural network to approximate $t$-SNE results (e.g., the "parametric" $t$-SNE paper) but this work has been superseded in part by the existence of (deep) autoencoders. Autoencoders are starting to be used as input / pre-processors to classifiers (especially DNN) exactly because they get very good performance in training as well as generalise naturally to new data.

$t$-SNE can be potentially used if we use a non-distance based clustering techniques like FMM (Finite Mixture Models) or DBSCAN (Density-based Models). As you correctly note, in such cases, the $t$-SNE output can quite helpful. The issue in these use cases is that some people might try to read into the cluster placement and not only the cluster membership. As the global distances are lost, drawing conclusions from cluster placement can lead to bogus insights. Notice that just saying: "hey, we found all the 1s cluster together" does not offer great value if cannot say what they are far from. If we just wanted to find the 1's we might as well have used classification to begin with (which bring us back to the use of autoencoders).

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • 1
    The Q seems to ask more about clustering than about classification. At least clustering is in the title. – amoeba Apr 12 '18 at 18:25
  • @amoeba: I thought the same and wrote about potential use through non-distance based clustering (eg. FMM, DBSCAN), but then I read the question: "*why is it not used as a dimensionality reduction technique that is then used for classification models?*" – usεr11852 Apr 12 '18 at 18:26
  • Yes, but the title Q is different. I think OP might be confused about the difference so it might make sense to address *both* in your A! – amoeba Apr 12 '18 at 18:29
  • 4
    OK.. OK... Slave-driving eukaryote... :P – usεr11852 Apr 12 '18 at 18:31
  • @amoeba: Thank you for the suggestion. It made the answer more complete. – usεr11852 Apr 12 '18 at 18:40
  • Thanks. My original question was a little vague so I clarified to ask why t-SNE is not used for dimensionality reduction or clustering. Luckily this answer addresses both points! – willk Apr 12 '18 at 19:21
  • @caseWestern: No problem, I am glad the answer is helpful. – usεr11852 Apr 12 '18 at 20:32
  • This is a good answer, but don't FMMs & DBSCAN still use distance? Eg, DBSCAN has a reachability distance based on some underlying conception of distance. Why wouldn't that be invalidated by the points you raise here? – gung - Reinstate Monica Apr 18 '18 at 20:38
  • @gung: It is less of distance in the sense of measure and more like similarity in terms of distribution/model. I fully accept it is a debatable point; given we define two data-points being different from one another, we naturally accept some notion of dissimilarity. A similar assessment on "measure vs model"-based approach was made by Tim [here](https://stats.stackexchange.com/questions/130805/are-there-any-non-distance-based-clustering-algorithms/130810#130810). – usεr11852 Apr 19 '18 at 18:55
  • 1
    (+1) I'd be very interested in hearing your thoughts on this clustering/t-SNE answer https://stats.stackexchange.com/questions/263539 I just posted. CC also to @caseWestern - this might be of interest to you too. – amoeba Jun 19 '18 at 14:54
5

t-SNE does not preserve distances, but it basically estimates probability distributions. In theory, the t-SNE algorithms maps the input to a map space of 2 or 3 dimensions. The input space is assumed to be a Gaussian distribution and the map space a t-distribution. The loss function used is the KL Divergence between the two distributions which is minimized using gradient descent.

According to Laurens van der Maaten who is a co-author of t-SNE

t-SNE does not retain distances but probabilities, so measuring some error between the Euclidean distances in high-D and low-D is useless.

Reference:

https://lvdmaaten.github.io/tsne/

https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm

prashanth
  • 3,747
  • 4
  • 21
  • 33
4

As a general statement: given a sufficiently powerful (/suitable) classifier, or cluster-er, one would never apply any dimensionality reduction.

Dimensionality reduction loses information.

Since such a cluster-er or classifier (esp classifiers, less so clusterers), internally incorperates some form of projection to a meaningful space already. And Dimensionality reduction is also projection to a (hopefuly) meaningful space.

But dimensionality reduction has to do so in a uninformed way -- it does not know what task you are reducing for. This is especially true for classification, where you have outright supervised information. But it also applies to clustering, where the space one would want to project to for clustering is better defined (for this algorithm) than just "have less dimensions). @usεr11852's answer talks about this. As I said dimensionality reduction does not know what task you are reducing for -- you inform it in your choice of which dimensionality reduction algorithm you to use.

So often rather than adding a dimensionality reduction step as preprocessing before clustering/classification, one is better to use a different classifier/cluster-er that incorperates a useful projection.

One thing dimentionality reduction does have going for it in this though is its unsupervised nature in creating the projection to the (hopefully) meaningful space. Which is useful if you have little label data. But there are often other methods that are closely linked to your classifier (e.g. for neural networks, using autoencoder e.g. deep belief network pretraining) that are going to work better, because they are designed with that final task in mind. Not the more general task of dimensionality reduction.

Lyndon White
  • 2,744
  • 1
  • 19
  • 35