Are PCA components of multivariate Gaussian data statistically independent?

Question

Are PCA components (in principal component analysis) statistically independent if our data is multivariate normally distributed? If so, how can this be demonstrated/proven?

I ask because I saw this post, where the top answer states:

PCA does not make an explicit Gaussianity assumption. It finds the eigenvectors that maximize the variance explained in the data. The orthogonality of the principal components means that it finds the most uncorrelated components to explain as much variation in the data as possible. For multivariate gaussian distributions, zero correlation between components implies independence which is not true for most distributions.

The answer is stated without a proof, and seems to imply that PCA produces independent components if the data is multivariate normal.

Specifically, say our data are samples from:

$$\mathbf{x} \sim \mathcal N(\mathbf{\mu}, \mathbf{\Sigma})$$

we put $n$ samples of $\mathbf{x}$ into rows of our matrix of samples $\mathbf{X}$, so $\mathbf{X}$ is $n \times m$. Computing the SVD of $\mathbf{X}$ (after centering) yields

$$\mathbf{X} = \mathbf{USV}^{T}$$

Can we say that the columns of $\mathbf{U}$ are statistically independent, also then the rows of $\mathbf{V}^T$? Is this true in general, just for $\mathbf{x} \sim \mathcal N(\mathbf{\mu}, \mathbf{\Sigma})$, or not true at all?

http://stats.stackexchange.com/q/110508/3277 is a similar question. — ttnphns, Feb 18 '15 at 16:58
I don't see how PCs could possibly be considered "statistically independent" in more than one dimension. After all, by definition each one is orthogonal to all the others; this *functional dependency* creates a very strong statistical dependency. — whuber, Feb 18 '15 at 19:44
@whuber: I rewrote the update to my answer that is trying to address your concern. I am wondering if you find it satisfactorily. My new understanding is that it only makes sense to talk about statistical independence of *random variables*; given the data $\mathbf X$, its PCs (columns of $\mathbf U$) are not random variables, so even to talk about their "independence" is strictly speaking problematic. If we talk about random variables instead (see my answer), then the whole issue vanishes, as I hope you will agree. Thanks a lot for bringing this issue up! — amoeba, Feb 20 '15 at 14:40
@amoeba You are correct that statistical independence can refer only to probability measures or random variables. We usually model the data $X$ with random variables, which is implied by the use of "statistically independent" in this question. Because $U$ is computed from $X$, it is *a fortiori* a random variable, too. It therefore is meaningful to assess the independence, or lack thereof, of the columns (or even components) of $U$. The very definition of $U$--the part that says the dot products of distinct columns are zero--shows its columns are not independent. — whuber, Feb 20 '15 at 14:46
@amoeba (continued). The reference to "uncorrelated does not mean independent" thoroughly confuses the issue by conflating *perpendicularity* of vectors with *independence* of random variables. Although these are related mathematical concepts, their meanings in this context are so completely different that I cannot help being unhappy with the potential confusion your answer will create. — whuber, Feb 20 '15 at 14:48
@whuber: I appreciate your replies (and am sorry for taking your time), but am now thoroughly confused about what you mean. Let me ask a clarifying question. Do you agree with the following? If not, then where is the mistake? (1) Let random vector $\vec X \sim \mathcal N(0,\Sigma)$. (2) Let $V$ be the matrix with eigenvectors of $\Sigma$ in columns. Consider random vector $\vec Y=V^\top \vec X$. (3) Then $\vec Y \sim \mathcal N(0, \mathrm{diag}(\sigma^2_i))$. (4) The elements $Y_i$ of $\vec Y$ are uncorrelated. (5) The elements $Y_i$ are statistically independent. — amoeba, Feb 20 '15 at 15:35
@Amoeba (2) does not correctly model what happens: PCA (and SVD) are performed on the *sample* covariance matrix; $\Sigma$ itself is unknown! — whuber, Feb 20 '15 at 16:35
@whuber: Yes, but how else do you suggest to *formulate* the question about statistical independence? Let $X \sim \mathcal N(0, \Sigma)$. Let us observe $n$ samples, with sample covariance $\hat \Sigma$. What then? Let $\hat V$ be the matrix of eigenvectors of $\hat \Sigma$. We can consider random vector $Y=\hat V^\top X$, but then $Y_i$, as random variables, will not even be uncorrelated because $\hat V \ne V$. Of course the sample of $n$ observations of $Y_i$ has sample correlations zero, but what is then the question of statistical independence about? Sample cannot be independent. — amoeba, Feb 20 '15 at 20:22
@amoeba The question has already been formulated; we are only trying to answer it. Your "of course" is incorrect. I invite you to run a simulation. In this `R` example you can study the relationship between $u_{11}$ and $u_{22}$, but simply change `stat` to study any other bivariate relations in the SVD. `library(mvtnorm); stat — whuber, Feb 20 '15 at 20:36
@whuber: I don't think I understand how exactly you understand the question. Could you tell me what precisely are the random variables, statistical independence of which we are discussing? Your R simulation makes me think that perhaps you are talking about $n$-dimensional random variables $U_i$, which are columns of $U$. Is that so? But under this interpretation the answer to the question "Are PCA components uncorrelated?" is "No" (or rather "They are not scalar, but individual elements of them are not uncorrelated"). Isn't it strange? People usually say that PCA components are uncorrelated. — amoeba, Feb 20 '15 at 22:29
@amoeba I hope I have been consistently clear as well as faithful to the question, which I find to be clearly stated and unambiguous: because the data $X$ are random, so are all the entries in $U$. I have applied the definition of statistical independence to them. That's all. Your problem appears to be that you are using the word "uncorrelated" in two very different senses without seemingly realizing it: by virtue of how the columns of $U$ are constructed, they are *geometrically orthogonal* as *vectors in $\mathbb{R}^n$*, but they are by no means independent random vectors! — whuber, Feb 20 '15 at 22:52
(Continued) In another comment, the OP remarked "When I asked the question, my thought of statistical dependence was "if you know PC1, is it possible infer PC2?, etc."" We can look at the situation this way, too. If you know the first column of $U$, then start to build a basis of $\mathbb{R}^n$ using it as the first basis element. Prolong it to any orthogonal basis you wish. Then, in this basis, *none of the remaining columns of $U$ can have a nonzero first entry.* To this extent we *can* infer something about all the other columns: they are not independent of the first. — whuber, Feb 20 '15 at 22:57
@whuber: I now see what you mean. However, I would still like to hear your answer to my last question. Namely: let's agree that "PCA components" (from the OP's title) means columns of $U$. They are random variables. Can I ask you a question: **Are PCA components uncorrelated?** Your R simulation demonstrates that their elements are in fact correlated. So your answer must be "No", right? How is that consistent with the [common assertion](https://www.google.com/search?q=pca+components+uncorrelated) (scroll through the google results if you like) that *PCA components are uncorrelated*? — amoeba, Feb 20 '15 at 23:06
@whuber: [cont.] I perfectly well understand that they are not uncorrelated as random vectors, but orthogonal in $\mathbb R^n$! But do you really want to say that all the thousands of texts speaking about "uncorrelated PCA components" mistake geometric orthogonality with the lack of correlation? — amoeba, Feb 20 '15 at 23:10
@amoeba You are right--the simulation pretty convincingly shows the correlation can be (strongly) nonzero. However, I am not disputing that "PCA components are uncorrelated" in the sense of "correlation"="orthogonal," nor am I saying any particular textbook is incorrect. My concern is that such a statement, *properly understood,* is so irrelevant to the question that all it can do (and has done) is sow extensive confusion in the present context. — whuber, Feb 20 '15 at 23:11
@whuber: Okay, now that your position is entirely clarified, let me say that I disagree entirely on what is "a proper understanding" here. I think your interpretation of what are random variables of interest here (columns of $U$) is weird and leads to strange conclusions, as e.g. that the PCs are not uncorrelated. My answer is written with another interpretation in mind and I think that most people also mean another interpretation. Under *that* interpretation, each column of $U$ is not a vector random variable, but $n$ samples of a scalar random variable. — amoeba, Feb 20 '15 at 23:21
@whuber: [cont.] Just to add a clarification - you said a couple of comments above that my "of course" was incorrect. In fact, under *my* interpretation (that I explicitly wrote down in that comments with formulas) it was perfectly correct. — amoeba, Feb 20 '15 at 23:24
@whuber: I am working on rewriting my answer and hope that you will find the new version more satisfactory. In the meantime, though, I realized that your R simulation behaves in a mysterious way. I get positive correlation of around $0.2$ between $u_{11}$ and $u_{22}$ even if I increase $n$ to $n=10000$. But correlation between $u_{11}$ and $u_{32}$ or any other further value in the second column of $U$ is around zero; it's only $u_{11}$ and $u_{22}$ that are correlated! Moreover, this does not happen if dimensionality of $X$ is $3$ or higher. I can reproduce it in Matlab. Can you explain it? — amoeba, Feb 24 '15 at 00:13
there is a chance that if you use PCA on data generated from uncorrelatod normal distribution then, if number of observations is small, it will find some components which don't fit to one univariate normal distribution, but lay in other direction, if there is bunch of such vector then, some of them will be correlated, just as realisations of univariate normal dist. which are very short, if PCA magnify this effect I don't know — Qbik, Feb 25 '15 at 10:14
@whuber and Andre5: I have rewritten my answer basically from scratch, trying to be as clear as I could about what caused the long debate between me and whuber. I am looking forward (with some trepidation) to your opinion, whuber. — amoeba, Feb 27 '15 at 22:42
I haven't the time now to read it with the care it deserves, @amoeba, but do want to thank you for your efforts. I skipped down to the end to read any conclusions and was surprised to see an assertion that is obviously not generally true. For instance, when $U$ is a $2\times 2$ matrix, $u_{12}=\pm\sqrt{1-u_{11}^2}$ gives a very strong counterexample to your statement that "you can't infer anything": knowing $u_{11}$ gives you all the possible information about $u_{12}$, rather than none. This suggests that the first part of your answer may be problematic, too. — whuber, Feb 27 '15 at 22:53
Excellent catch, @whuber! I was briefly hesitating when writing that sentence for exactly this reason, but I should have known that you would spot it right away. The problem with $2\times 2$ case is that unit-length constraints plus the orthogonality constraint make $U$ one-dimensional; it is a very special case. We should either agree to consider only $n>2$ where this reasoning does not work anymore, or (rather) analyze $US$ instead of $U$; it is actually columns of $US$ that are usually called "principal components" and they don't have a unit-length constraint. — amoeba, Feb 27 '15 at 23:03
@amoeba Unfortunately, choosing $n\gt 2$ doesn't make the dependence go away: it only makes it harder to see. As an extreme example, when $u_{11}=\pm 1$, it is impossible for $u_{12}$ to be nonzero (because the inner product of the first two columns would equal $u_{11}u_{12}$, which must be zero). I agree there are merits in analyzing $US$, but the question does explicitly ask about "the columns of $U$" alone. — whuber, Feb 27 '15 at 23:10
@whuber: Indeed. I need to think a bit more about it. Perhaps I will end up rewriting the whole thing for $US$. The question in its mathematical part does ask about columns of $U$ alone, yes, but it also explicitly asks (in the title and in the introduction) about "PCA components", so there is some inconsistency here. When people talk about "PC1" (or "PC1 scores"), they always refer to columns of $US$. — amoeba, Feb 27 '15 at 23:19
I would advise caution about such a unilateral re-interpretation, @amoeba. The question is (unusually) clear about the focus on $U$. Generally, when confronted with an accurate mathematical description seemingly at odds with English language usage, you should go with the math every time, especially when it is offered (as it is here) as a specific clarification of the intended meaning. Perhaps Andre5 will weigh in on this: because nobody else has attempted to answer and you are willing to change yours, a substantial edit to the question would be acceptable. — whuber, Feb 27 '15 at 23:27
First, thanks to both of you for spending so much time on this. I'd be happy to further edit the question to be more specific. I don't see the distinction between U and US, the answer could apply to either no? After all, USV^T = X, so we can get from X to U (or US, or some other combo) with a linear transformation. For instance U = X(V^T S)^-1, etc. I guess the core of my question is if the SVD can be used to produce a not just orthonormal but also statistically independent basis for a set of columns in X, each a sample from a MVNormal. — bill_e, Feb 28 '15 at 01:28
Sorry if my response if off the mark. I need to spend some time to fully understand the discussion that has happened in the comments! Also, feel free to suggest edits that can either clarify or improve (by making it "the right question") my original question. — bill_e, Feb 28 '15 at 01:28
Andre5, honestly I am not exactly sure that distinction between U and US is crucial here. I must say that I am amazed by how much more tricky this question turns out to be than I originally thought. That "PCA components of Gaussian data are independent" is a commonplace (google it if you want); [here is a figure](http://i.imgur.com/8uEoKTv.png) that I made to illustrate it (@whuber, take a look too). X and Y on the left are clearly dependent and on the right clearly (?) independent. But I am still struggling to formulate it precise enough, as demonstrated by our Socratic dialog with @whuber. — amoeba, Feb 28 '15 at 15:37
@whuber, I am sure you have been looking forward for yet another edition of my answer! Here it is. I explicitly acknowledge your points about dependency, and make a statement that columns of $U$ are *asymptotically* independent, as my main point. Here "asymptotically" refers to the number $n$ of observations (rows). I very much hope we will be able to agree on that! I also argue that for any reasonable $n$, such as $n=100$, the dependence between columns is "practically irrelevant". This I guess is a more contentious point, but I try to make it reasonably precise in my answer. — amoeba, Mar 03 '15 at 23:07

score 24 · Accepted Answer · edited Apr 13 '17 at 12:44

I will start with an intuitive demonstration.

I generated $n=100$ observations (a) from a strongly non-Gaussian 2D distribution, and (b) from a 2D Gaussian distribution. In both cases I centered the data and performed the singular value decomposition $\mathbf X=\mathbf{USV}^\top$. Then for each case I made a scatter plot of the first two columns of $\mathbf U$, one against another. Note that it is usually columns of $\mathbf{US}$ that are called "principal components" (PCs); columns of $\mathbf U$ are PCs scaled to have unit norm; still, in this answer I am focusing on columns of $\mathbf U$. Here are the scatter-plots:

PCA of Gaussian and non-Gaussian data

I think that statements such as "PCA components are uncorrelated" or "PCA components are dependent/independent" are usually made about one specific sample matrix $\mathbf X$ and refer to the correlations/dependencies across rows (see e.g. @ttnphns's answer here). PCA yields a transformed data matrix $\mathbf U$, where rows are observations and columns are PC variables. I.e. we can see $\mathbf U$ as a sample, and ask what is the sample correlation between PC variables. This sample correlation matrix is of course given by $\mathbf U^\top \mathbf U=\mathbf I$, meaning that the sample correlations between PC variables are zero. This is what people mean when they say that "PCA diagonalizes the covariance matrix", etc.

Conclusion 1: in PCA coordinates, any data have zero correlation.

This is true for the both scatterplots above. However, it is immediately obvious that the two PC variables $x$ and $y$ on the left (non-Gaussian) scatterplot are not independent; even though they have zero correlation, they are strongly dependent and in fact related by a $y\approx a(x-b)^2$. And indeed, it is well-known that uncorrelated does not mean independent.

On the contrary, the two PC variables $x$ and $y$ on the right (Gaussian) scatterplot seem to be "pretty much independent". Computing mutual information between them (which is a measure of statistical dependence: independent variables have zero mutual information) by any standard algorithm will yield a value very close to zero. It will not be exactly zero, because it is never exactly zero for any finite sample size (unless fine-tuned); moreover, there are various methods to compute mutual information of two samples, giving slightly different answers. But we can expect that any method will yield an estimate of mutual information that is very close to zero.

Conclusion 2: in PCA coordinates, Gaussian data are "pretty much independent", meaning that standard estimates of dependency will be around zero.

The question, however, is more tricky, as shown by the long chain of comments. Indeed, @whuber rightly points out that PCA variables $x$ and $y$ (columns of $\mathbf U$) must be statistically dependent: the columns have to be of unit length and have to be orthogonal, and this introduces a dependency. E.g. if some value in the first column is equal to $1$, then the corresponding value in the second column must be $0$.

This is true, but is only practically relevant for very small $n$, such as e.g. $n=3$ (with $n=2$ after centering there is only one PC). For any reasonable sample size, such as $n=100$ shown on my figure above, the effect of the dependency will be negligible; columns of $\mathbf U$ are (scaled) projections of Gaussian data, so they are also Gaussian, which makes it practically impossible for one value to be close to $1$ (this would require all other $n-1$ elements to be close to $0$, which is hardly a Gaussian distribution).

Conclusion 3: strictly speaking, for any finite $n$, Gaussian data in PCA coordinates are dependent; however, this dependency is practically irrelevant for any $n\gg 1$.

We can make this precise by considering what happens in the limit of $n \to \infty$. In the limit of infinite sample size, the sample covariance matrix is equal to the population covariance matrix $\mathbf \Sigma$. So if the data vector $X$ is sampled from $\vec X \sim \mathcal N(0,\boldsymbol \Sigma)$, then the PC variables are $\vec Y = \Lambda^{-1/2}V^\top \vec X/(n-1)$ (where $\Lambda$ and $V$ are eigenvalues and eigenvectors of $\boldsymbol \Sigma$) and $\vec Y \sim \mathcal N(0, \mathbf I/(n-1))$. I.e. PC variables come from a multivariate Gaussian with diagonal covariance. But any multivariate Gaussian with diagonal covariance matrix decomposes into a product of univariate Gaussians, and this is the definition of statistical independence:

\begin{align} \mathcal N(\mathbf 0,\mathrm{diag}(\sigma^2_i)) &= \frac{1}{(2\pi)^{k/2} \det(\mathrm{diag}(\sigma^2_i))^{1/2}} \exp\left[-\mathbf x^\top \mathrm{diag}(\sigma^2_i) \mathbf x/2\right]\\&=\frac{1}{(2\pi)^{k/2} (\prod_{i=1}^k \sigma_i^2)^{1/2}} \exp\left[-\sum_{i=1}^k \sigma^2_i x_i^2/2\right] \\&=\prod\frac{1}{(2\pi)^{1/2}\sigma_i} \exp\left[-\sigma_i^2 x^2_i/2\right] \\&= \prod \mathcal N(0,\sigma^2_i). \end{align}

Conclusion 4: asymptotically ($n \to \infty$) PC variables of Gaussian data are statistically independent as random variables, and sample mutual information will give the population value zero.

I should note that it is possible to understand this question differently (see comments by @whuber): to consider the whole matrix $\mathbf U$ a random variable (obtained from the random matrix $\mathbf X$ via a specific operation) and ask if any two specific elements $U_{ij}$ and $U_{kl}$ from two different columns are statistically independent across different draws of $\mathbf X$. We explored this question in this later thread.

Here are all four interim conclusions from above:

In PCA coordinates, any data have zero correlation.
In PCA coordinates, Gaussian data are "pretty much independent", meaning that standard estimates of dependency will be around zero.
Strictly speaking, for any finite $n$, Gaussian data in PCA coordinates are dependent; however, this dependency is practically irrelevant for any $n\gg 1$.
Asymptotically ($n \to \infty$) PC variables of Gaussian data are statistically independent as random variables, and sample mutual information will give the population value zero.

You write "However, if the data are multivariate Gaussian, then they are indeed independent". 'They' being the principal components, and their coefficients? What do you mean by PCA diagonalizes the covariance matrix? Thank you for your response! — bill_e, Feb 18 '15 at 16:20
"They" refers to principal components (which are projections of the data on the directions of maximal variance). PCA looks for directions of maximal variance; turns out that these directions are given by the eigenvectors of the covariance matrix. If you change the coordinates to the "PCA coordinates", then the covariance matrix will be diagonal, that is how eigendecomposition works. Equivalently, matrix $S$ in the SVD from your question is a diagonal matrix. Also, matrix $U$ is orthogonal, meaning that its covariance matrix is diagonal. All of that means that PCs have correlation zero. — amoeba, Feb 18 '15 at 16:29
Cool, thank you! The combination of your answer and this comment helps clear things up for me a lot. Can I edit your comment into your answer? — bill_e, Feb 18 '15 at 16:33
I expanded the answer by incorporating the comment; see if you are happy with it now. — amoeba, Feb 18 '15 at 16:48
For sure, I think it makes the series of steps to the conclusion very clear. Thank you! — bill_e, Feb 18 '15 at 16:49
I cannot quite figure out what you mean by PCs being "independent." Consider the situation where you are performing PCA of bivariate samples from some continuous distribution. Almost surely the two PCs are determined by an angle $0\lt\theta\lt \pi$; they are given by $(\cos(\theta),\sin(\theta)$ and $(-\sin(\theta),\cos(\theta))$. Since the second is always a 90 degree rotation of the first, they are *perfectly dependent,* not independent! I suspect your argument might be confounding two very different senses of "correlation": that of random variables and that of a set of multivariate data. — whuber, Feb 18 '15 at 19:42
@whuber: (1) It seems that the confusion is mostly due to terminology. What I call "principal components" are projections of the data onto the covariance eigenvectors; they have $n$ points each. In your example, 2D covariance eigenvectors I would call "principal axes"; in my terminology, they are not "PCs". The eigenvectors are of course not independent, you are right! This is a good point. But the PCs are. (2) Still, the PCs have exactly zero sample correlation, so perhaps you would still say that they have functional dependence? It's very "weak" though, and goes to zero with $n\to \infty$. — amoeba, Feb 18 '15 at 21:43
What *exactly* do you mean by the "covariance eigenvectors"? Are they parameters of the multinormal distribution or are they the actual estimates, using SVD, based on the *data*? According to the question I would understand them as the latter--but then it is not at all clear the projections are independent. The functional dependence among these eigendirections is *extremely strong*: despite being described by $np$ numbers (for $n$ observations of a $p$-variate distribution), they only lie within a $\binom{p}{2}$-dimensional submanifold. — whuber, Feb 18 '15 at 22:08
@whuber, I am talking about the sample estimates. But the question was not about the eigenvectors of the sample covariance matrix, it was about *the principal components*, which are the projections on these eigenvectors. Each of them has $n \gg 2$ numbers. Take the first two PCs. The dependency you are talking about is that the sample correlation between them is zero; this is one constraint for $2n$ numbers. How is this "extremely strong"? It looks weak to me, and it practically disappears with large $n$. — amoeba, Feb 18 '15 at 22:20
The question explicitly is about the columns of $U$. As $n$ grows large, the number of columns remains the same--and they remain orthogonal. Apply the definition of dependence to columns $U_i$ and $U_j$, $j\ne i$: independence implies that for any measurable $E\times F\subset\mathbb{R}^n\times\mathbb{R}^n$, $\Pr((U_i,U_j)\in E\times F)=\Pr(U_i\in E)\Pr(U_j \in F)$. It is easy to find $E$ and $F$ for which both right-hand probabilities are positive but the left-hand one is zero, *regardless* of how large $n$ might be. — whuber, Feb 18 '15 at 22:27
@whuber: Okay, so you insist that no two data vectors that have precisely zero sample correlation can be independent, because zero correlation *is* a dependency. I guess you are right. I informally call it "weak" to mean that the two PCs cannot e.g. be quadratically related to each other! That would be a "strong" dependency. In reality, they are only dependent because their correlation is precisely zero. I am sure you can see my point, but I admit that it is informal. — amoeba, Feb 18 '15 at 22:35
I am not talking about sample correlation at all! Nor am I "insisting" on anything. It is an unhappy fact that mathematics and the basic definitions force upon us the conclusion that there must be something fundamentally wrong with this answer. — whuber, Feb 18 '15 at 22:38
@whuber: Wait -- you are not talking about sample correlation? Perhaps I misunderstood all along. You were saying that columns of $U$ are orthogonal (and they are), but this is equivalent to saying that columns of $U$, taken as PCs, have zero correlation between each other. — amoeba, Feb 18 '15 at 22:41
I invoke the orthogonality of the columns only because that immediately implies their *lack* of independence. If you generate a pair of $n$-vectors in any way whatsoever, then *by definition* that pair is independent when knowing the first lies in some event $E$ gives no additional information about whether the second lies in some event $F$. The orthogonality of the columns easily and intuitively implies this is *not* the case. As soon as you know the first column, all subsequent columns must lie in the orthogonal hyperplane--which under independence has *zero* chance of happening. — whuber, Feb 18 '15 at 22:52
@whuber, I see your point. I made an update to my answer in response. The actual reply is in the last two paragraphs. Let me know what you think. — amoeba, Feb 18 '15 at 22:58
Interesting discussion! When I asked the question, my thought of statistical dependence was "if you know PC1, is it possible infer PC2?, etc." I will look more into independence tests based on mutual information now. — bill_e, Feb 19 '15 at 18:48

Are PCA components of multivariate Gaussian data statistically independent?

1 Answers1

Linked