Density-lines in PCA plot?

Question

Short version: Someone wants me to draw "density-lines" onto a PCA plot for every point on the PCA plot. Which does not make sense in my understanding. PCA is transforming the high dimensional data to principal components. How am I supposed to add a "density" for every point on a plot showing the two most relevant principal components?

Longer version:

I have high-dimensional data. Gene expression data, to be precise (about n = 22000 dimensions). The data is from k biological samples, so every sample has values for the about 22'000 dimensions.
Now I calculate the principal components and show the two principal components which explain most of the variation. The k samples now fall on the 2D plane reflecting the two most relevant principal components as they are calculated for the respective sample.
My collaborator now wants "denisty-curves" on the PCA plot for every one of the k points. Which does not work for a PCA plot, right? We reduce the n-dimensional space to fewer dimensions and then project the data onto this (in my case) two principal components. So in my understanding density lines for the reduced data points do not make sense, do you agree?

Example -- Let's assume a PCA plot as in the first figure here: https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html Would you say that some sort of density lines for every point reflecting the underlying data does make sense in this case?

This sounds like a request for a density estimate, such as a KDE, rather than for portraying some kind of confidence regions. — whuber, Oct 21 '21 at 16:33
Yes, agree. But still, this does not make sense in the context of a PCA plot, right? — Michael, Oct 22 '21 at 09:48
You have made it clear that this is a density estimate: but what would be the matter with that? Because you object that it "goes against the idea," could you indicate which "idea" you have in mind and how a density estimate would be inconsistent with that idea? — whuber, Oct 22 '21 at 16:40
Sorry, for not being precise. I try to clarify: My initial questions is if it makes sense to draw density lines on a PCA plot. I do not think so but am not 100% sure. This is my main question. My follow-up question is if there is some other statistic to show that the PCA (and the clustering of the samples on the PCA plot) is "trustworthy". — Michael, Oct 25 '21 at 08:25
Please consider that the density has to be drawn for every point on the PCA plot, I guess this fact caused the confusion. — Michael, Oct 25 '21 at 10:22
@Aksakal So given that we understand each other correctly; you agree that this is -- mildly speaking -- nonsense? — Michael, Oct 25 '21 at 11:27
yes, if it is indeed the way you described it. maybe she had an idea that she didnt think through, and a white board session would resolve the issue with her — Aksakal, Oct 25 '21 at 11:55
Yes, but first wanted to investigate if there is anything wrong on my side. Thanks! — Michael, Oct 25 '21 at 11:58
The density contours still make sense--it's just the request to draw the contours through each point that is nonsensical (and practically useless). — whuber, Oct 25 '21 at 13:22

Sextus Empiricus · Answer 1 · 2021-10-25T11:23:45.710

1

The data points are, after PCA transformation, distributed on the plane of the principal component axes.

That is an interpretation for which you can plot the density (ie. the density of the probability for a point to have a certain score on the principal components).

There are quite some examples with a search on the internet https://www.google.ch/search?q=pca+density+plot&source=lnms&tbm=isch

And in relation to your example with the iris dataset

https://www.google.ch/search?q=pca+density+iris&source=lnms&tbm=isch

I am linking to examples on Google because currently I have no R available to make a graph (maybe I will make an example myselve later). Below is an example from Fisher's article on the Iris data set. It is not PCA but LDA, and also it is not density lines but histograms/distributions. The principle in the sense that it is a projection onto a space of lower dimensions (in this case 1 dimension) and a display of the density distribution on that projected space.

"The use of multiple measurements in taxonomic problems" Annals of Eugenics, Vol VII, Pt. II, op. 179-188, 136

example of distribution for projection of iris data

Marginal distribution

When you have multivariate data, where the multiple variables are distributed according to some joint distribution, then you can still think of the distribution of a single variable. This is also called a marginal distribution.

It is not different for PCA and a distribution for the principal component. You can view this as the shadow of a multidimensional joint distribution being projected onto the lower dimensional space. The distribution of the data in this lower dimensional space is like a marginal distribution.

See below an example of this projecting. The data is distributed in 3 dimensions (according to some density) but you can project the points into a lower space and following that think about the density in that lower space.

Images from this question: Interpreting PCA figures in layman terms

edited Oct 25 '21 at 11:23

answered Oct 25 '21 at 09:47

Sextus Empiricus

43,080
1
72
161

Please consider the fact that the density has to be drawn for every data point on the PCA plot. This was not accurately described by me initially. Does your reply still hold in this case? – Michael Oct 25 '21 at 10:24
@Michael is the problem about how to make density plots or about the issue with density plots in connection with PCA. How would you make a density plot if your data did not have 22000 dimensions, but instead only 2 and there was no PCA involved? Would there be still problems? – Sextus Empiricus Oct 25 '21 at 10:44
You write "Which does not work for a PCA plot, right? ". Maybe I do not completely understand what you are doing, but could you explain it by saying how you see it work for a non-PCA plot? – Sextus Empiricus Oct 25 '21 at 10:53
If I would have only 2 dimensions initially (instead of the 22'000) then a density plot would NOT make sense in my case. – Michael Oct 25 '21 at 11:13
@Michael but do you have some example of the type of density plot that you are looking for? You write "which does not work for PCA, right?”. Currently I am stuck at understanding what sort of density lines you are thinking about and why they do not work *because* it is for PCA. So in order to get clear what the problem is with the situation not working for PCA, I am asking whether you have an example of the density-lines in a situation where they do work. If PCA is the problem, then how would it work if there was no PCA? – Sextus Empiricus Oct 25 '21 at 11:28
1

There is a subtle difference between the PCA situation and the marginal density analogy: the PCA subspace is determined by the data rather than specified independently. Thus, it's not quite correct to interpret the PCA of the points as being some kind of marginal empirical distribution. I believe the KDE could still be interpreted as an estimator of a marginal distribution, but we have to bear in mind that the KDE will have slightly more uncertainty than expected (due to the uncertainty in the estimated PCs). – whuber Oct 25 '21 at 13:21

Mike Holcomb · Answer 2 · 2021-10-28T22:01:10.913

It sounds like there are two problems: one of methodology and another of conflict resolution. My initial impression is say to just "split the baby". At the fundamental level, the collaborator is just asking for a simpler way to visualize the results of the PCA so what could be the harm in delivering a "prettier graph." At the same time, everyone needs to be well aware as you rightly point out no one should try to make any inferences using the resulting marginals.

To that end, one proposed way to demonstrate the issue to the collaborator is to run a sensitivity analysis on the data to illustrate how much potential there is for estimation error in the method they have requested. You can bootstrap several samples or add varying degrees of noise and inspect how different the plots are between the full dataset and the augmented ones.

Here is an article on unsupervised dimensionality reduction on gene expression data which found that PCA was relatively robust in this regard: Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data. That said, in Figure 6 of their paper they found that with a very low number of principal components, the issue you described can be very impactful.

Density-lines in PCA plot?

2 Answers2

Marginal distribution