What's wrong with t-SNE vs PCA for dimensional reduction using R?

Question

I have a matrix of 336x256 floating point numbers (336 bacterial genomes (columns) x 256 normalized tetranucleotide frequencies (rows), e.g. every column adds up to 1).

I get nice results when I run my analysis using principle component analysis. First I calculate the kmeans clusters on the data, then run a PCA and colorize the data points based on the initial kmeans clustering in 2D and 3D:

library(tsne)
library(rgl)
library(FactoMineR)
library(vegan)
# read input data
mydata <-t(read.csv("freq.out", header = T, stringsAsFactors = F, sep = "\t", row.names = 1))
# Kmeans Cluster with 5 centers and iterations =10000
km <- kmeans(mydata,5,10000)
# run principle component analysis
pc<-prcomp(mydata)
# plot dots
plot(pc$x[,1], pc$x[,2],col=km$cluster,pch=16)
# plot spiderweb and connect outliners with dotted line
pc<-cbind(pc$x[,1], pc$x[,2])
ordispider(pc, factor(km$cluster), label = TRUE)
ordihull(pc, factor(km$cluster), lty = "dotted")

enter image description here

# plot the third dimension
pc3d<-cbind(pc$x[,1], pc$x[,2], pc$x[,3])
plot3d(pc3d, col = km$cluster,type="s",size=1,scale=0.2)

enter image description here

But when I try to swap the PCA with the t-SNE method, the results look very unexpected:

tsne_data <- tsne(mydata, k=3, max_iter=500, epoch=500)
plot(tsne_data[,1], tsne_data[,2], col=km$cluster, pch=16)
ordispider(tsne_data, factor(km$cluster), label = TRUE)
ordihull(tsne_data, factor(km$cluster), lty = "dotted")

enter image description here

plot3d(tsne_data, main="T-SNE", col = km$cluster,type="s",size=1,scale=0.2)

enter image description here

My question here is why the kmeans clustering is so different from what t-SNE calculates. I would have expected an even better separation between the clusters than what the PCA does but it looks almost random to me. Do you know why this is? Am I missing a scaling step or some sort of normalization?

Please note that with PCA, too, you often won't get as "good" results as you happend to get. Clustering on many features and then projecting the clusters in the subspace of just few first PCs may well show a picture like you obtained here for t-SNE, - unless those PCS grab almost all the variability. Did you compare - what portion of the variability is captured by your first 3 PCs and your first 3 t-SNE-dimensions? — ttnphns, Nov 07 '14 at 08:44
I have played with the iterations with up to 2000 and also played with various perplexity settings, but never seen something even close to the performance the PCA shows. — Loddi, Nov 07 '14 at 19:39
You should try to use larger perplexity that will result in less number of clusters. Also, I would try to create a map for the 256 attributes by using the transposed data table. If the attribute map is a random cloud, the PCA map might be less trustworthy. Another way to validate PCA or tSNE is to create map for a subset of your data, say a single cluster created with kmean. That map should be similar as the fragment in the map created for the whole dataset. — James LI, May 18 '15 at 17:27
tSNE has a theoretical optimum perplexity that minimizes the KL divergence between your data in its original and projected dimensions. Have you tried first doing a grid search for perplexity? E.g. 10,20,30,40,etc — Alex R., Nov 03 '16 at 22:14
The R package "tsne" actually contains one small bug. And you can fix it with the function fix(tsne). [![enter image description here](http://i.stack.imgur.com/rS0kM.jpg)](http://i.stack.imgur.com/rS0kM.jpg) Fix it this way:[![enter image description here](http://i.stack.imgur.com/QSYiM.jpg)](http://i.stack.imgur.com/QSYiM.jpg) — Rum Wei, Jun 30 '16 at 00:40
Some people in my lab run tsne on the samples loadings on the first few principal components. Seems to work really well. So for you: `tsne_data — kmace, Jun 07 '17 at 16:30

score 10 · Answer 1 · answered Nov 03 '16 at 21:18

You have to understand what TSNE does before you use it.

It starts by building a neighboorhood graph between feature vectors based on distance.

The graph connects a node(feature vector) to its n nearest nodes(in terms of distance in feature space). This n is called the perplexity parameter.

The purpose of building this graph is rooted in the sort of sampling TSNE relies on to build its new representation of your feature vectors.

A sequence for TSNE model building is generated using a random walk on your TSNE feature graph.

In my experience... a few of my problems came from reasoning about how feature representation affects the building of this graph. I also play around with the perplexity parameter, as it has an effect on how focused my sampling is.

RUser4512 · Answer 2 · 2020-03-08T18:19:22.280

It is hard to compare these approaches.

PCA is parameter free. Given the data, you just have to look at the principal components.

On the other hand, t-SNE relies on severe parameters : perplexity, early exaggeration, learning rate, number of iterations - though default values usually provide good results.

So you can't just compare them, you have to compare the PCA to the best result you can achieve with t-SNE (or the best result you achieved over several tries of t-SNE). Otherwise, it would be equivalent to ask "why does my linear model performs better than my (not tuned) gradient boosting model?".

Edit After seeing many questions related to the differences between these two approaches, I wrote this blog post summarizing the pros and cons of each method.

Nestor Demeure · Answer 3 · 2017-07-17T09:37:35.380

I ran t-sne on a dataset to replace PCA and (despite the bug that Rum Wei noticed) got better results. In my application case, rough pca worked well while rough t-sne gave me random looking results. It was due to the scaling/centering step included in the pca (by default in most packages) but not used in the t-sne.
My points were areas and the distance between them made little sense without previous scaling, it got me from "random looking" to "make sense".

As RUser4512 said, you might also want to test your parameters. On his website, the author recommends a perplexity between 5 and 50 (yours seems quite small), he also warns that too big a perplexity will give you an almost homogeneous sphere of points (which is good to know).

Distill has a very nice article with some interactive visualization that really helps to understand the impact of the parameters.

score 0 · Answer 4 · answered May 24 '18 at 07:17

An important difference between methods like PCA and SVD with tSNE is that tSNE is using a non-linear scale. This often makes for plots that are more visually balanced but be careful interpreting them in the same manner as you would for PCA. This difference likely accounts for the difference between the plots shown above.

See the following article for more detail on interpreting the non-linear scale of tSNE: https://distill.pub/2016/misread-tsne/ (Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016. http://doi.org/10.23915/distill.00002)

It is not unexpected that the tSNE data will be mixing up the "clusters" as they are not that distinct in the PCA data. Some points within clusters 2 and 4 are more distant from the cluster centroid than the difference between the clusters for example. You would get very different clustering results with a different k-parameter. Unless you have a specific biological rationale for using 5 clusters, I would recommend using a graph-based or unsupervised hierarchical clustering approach.

What's wrong with t-SNE vs PCA for dimensional reduction using R?

4 Answers4

Linked