What is the difference between ZCA whitening and PCA whitening?

Question

I am confused about ZCA whitening and normal whitening (which is obtained by dividing principal components by the square roots of PCA eigenvalues). As far as I know,

$$\mathbf x_\mathrm{ZCAwhite} = \mathbf U \mathbf x_\mathrm{PCAwhite},$$ where $\mathbf U$ are PCA eigenvectors.

What are the uses of ZCA whitening? What are the differences between normal whitening and ZCA whitening?

According to "Neural Networks: Tricks of the Trade", PCA and ZCA differ only by a rotation. — Martin Thoma, Mar 28 '17 at 13:18

score 99 · Accepted Answer · edited Apr 13 '17 at 12:44

Let your (centered) data be stored in a $n\times d$ matrix $\mathbf X$ with $d$ features (variables) in columns and $n$ data points in rows. Let the covariance matrix $\mathbf C=\mathbf X^\top \mathbf X/n$ have eigenvectors in columns of $\mathbf E$ and eigenvalues on the diagonal of $\mathbf D$, so that $\mathbf C = \mathbf E \mathbf D \mathbf E^\top$.

Then what you call "normal" PCA whitening transformation is given by $\mathbf W_\mathrm{PCA} = \mathbf D^{-1/2} \mathbf E^\top$, see e.g. my answer in How to whiten the data using principal component analysis?

However, this whitening transformation is not unique. Indeed, whitened data will stay whitened after any rotation, which means that any $\mathbf W = \mathbf R \mathbf W_\mathrm{PCA}$ with orthogonal matrix $\mathbf R$ will also be a whitening transformation. In what is called ZCA whitening, we take $\mathbf E$ (stacked together eigenvectors of the covariance matrix) as this orthogonal matrix, i.e. $$\mathbf W_\mathrm{ZCA} = \mathbf E \mathbf D^{-1/2} \mathbf E^\top = \mathbf C^{-1/2}.$$

One defining property of ZCA transformation (sometimes also called "Mahalanobis transformation") is that it results in whitened data that is as close as possible to the original data (in the least squares sense). In other words, if you want to minimize $\|\mathbf X - \mathbf X \mathbf A^\top\|^2$ subject to $ \mathbf X \mathbf A^\top$ being whitened, then you should take $\mathbf A = \mathbf W_\mathrm{ZCA}$. Here is a 2D illustration:

PCA and ZCA whitening

Left subplot shows the data and its principal axes. Note the dark shading in the upper-right corner of the distribution: it marks its orientation. Rows of $\mathbf W_\mathrm{PCA}$ are shown on the second subplot: these are the vectors the data is projected on. After whitening (below) the distribution looks round, but notice that it also looks rotated --- dark corner is now on the East side, not on the North-East side. Rows of $\mathbf W_\mathrm{ZCA}$ are shown on the third subplot (note that they are not orthogonal!). After whitening (below) the distribution looks round and it's oriented in the same way as originally. Of course, one can get from PCA whitened data to ZCA whitened data by rotating with $\mathbf E$.

The term "ZCA" seems to have been introduced in Bell and Sejnowski 1996 in the context of independent component analysis, and stands for "zero-phase component analysis". See there for more details. Most probably, you came across this term in the context of image processing. It turns out, that when applied to a bunch of natural images (pixels as features, each image as a data point), principal axes look like Fourier components of increasing frequencies, see first column of their Figure 1 below. So they are very "global". On the other hand, rows of ZCA transformation look very "local", see the second column. This is precisely because ZCA tries to transform the data as little as possible, and so each row should better be close to one the original basis functions (which would be images with only one active pixel). And this is possible to achieve, because correlations in natural images are mostly very local (so de-correlation filters can also be local).

PCA and ZCA in Bell and Sejnowski 1996

Update

More examples of ZCA filters and of images transformed with ZCA are given in Krizhevsky, 2009, Learning Multiple Layers of Features from Tiny Images, see also examples in @bayerj's answer (+1).

I think these examples give an idea as to when ZCA whitening might be preferable to the PCA one. Namely, ZCA-whitened images still resemble normal images, whereas PCA-whitened ones look nothing like normal images. This is probably important for algorithms like convolutional neural networks (as e.g. used in Krizhevsky's paper), which treat neighbouring pixels together and so greatly rely on the local properties of natural images. For most other machine learning algorithms it should be absolutely irrelevant whether the data is whitened with PCA or ZCA.

Thanks! I have a question: so is that mean ZCA is basically change the access, but not change much the position of the data? (based on your shading area). Also, is that mean whenever we do whitening, we should do ZCA whitening? How would we decide to use PCAwhitening or ZCA whitening? — RockTheStar, Oct 01 '14 at 18:12
(1) I am not exactly sure what you mean, but I would say it like that: ZCA stretches the dataset to make it spherical, but *tries not to rotate it* (whereas PCA does rotate it quite a lot). (2) I actually think that in most cases it does not matter if you use PCA or ZCA whitening. The only situation I can imagine where ZCA could be preferable, is pre-processing for convolutional neural networks. Please see an update to my answer. — amoeba, Oct 01 '14 at 21:38
@amoeba What does it mean to take a matrix to the power of -1/2 ? — power, Feb 06 '16 at 15:14
@power For the diagonal matrix ($\mathbf D$ in this answer) it simply means raising every diagonal element to the power $-1/2$. For a covariance matrix $\mathbf C = \mathbf E \mathbf D \mathbf E^\top$ we can define it as $\mathbf E \mathbf D^{-1/2} \mathbf E^\top$. — amoeba, Feb 06 '16 at 15:21
I'm a bit puzzled by the last graphic. What does it mean? It's one thing to plot the first principal components, but after the ZCA whitening, there are no "components", you can just look to how an image looks like after whitening. Or maybe after whitening keeping just the few first components. But anyway, you should compare this to the images reconstructed from PCA, not to the principal components themselves, that doesn't make sense. So what does it mean anyway?... — dividebyzero, Feb 08 '17 at 19:14
@dividebyzero That's just rows of $W$ (each row, i.e. each PCA eigenvector or each ZCA projecting vector, is depicted as an image). You can say that there *are* "components" after ZCA, why not. — amoeba, Feb 08 '17 at 19:39
The PCA is like making a Fourier transform, the ZCA is like transforming, multiplying and transforming back, applying a (zero-phase) linear filter. So what we see there is the filter impulse response at each pixel. The "components" involved in the operation are the same, the columns of E, which are the "principal components"... I mean, you can call the rows of W components too, but I think it is important to understand that the same "principal components" are involved, and when you apply the ZCA you are back at the original domain, while with the PCA you need to "reconstruct" the signal. — dividebyzero, Feb 08 '17 at 22:20
@dividebyzero +1 to your last comment, I think this is a valuable perspective. In any case, I hope the meaning of my last figure (that is taken from the linked paper) is clear now. — amoeba, Feb 08 '17 at 22:25
Sure. I had to go after the paper to really understand it. The original picture has a lot more information, so I advise anyone reading this to go after it!... Thanks for the great reference, btw. :) — dividebyzero, Feb 09 '17 at 08:53
I'm not so sure about "ZCA whitened images resemble regular images; PCA whitened images don't". http://ufldl.stanford.edu/tutorial/unsupervised/ExercisePCAWhitening/ There we see that PCA images and ZCA images both look like MNIST numbers. I can in theory see what you're saying, but I guess I can't see it empirically. — learning, Oct 07 '17 at 07:58
@learning You don't see PCA whitened images on that page! They show "PCA dimension-reduced images", i.e. *reconstructions* via PCA, but not PCA projections themselves. — amoeba, Oct 07 '17 at 08:26
@amoeba How can one prove that ZCA is given by "minimization of $∥X−XA^\top∥^2$ subject to $XA^\top$ being whitened"? — user_anon, Jul 20 '19 at 12:30
"whitened data will stay whitened after a rotation" why is this true? $Cov(RX) = RCov(X)R^T$ This does not have to be diagonal even if $Cov(X)$ is diagonal and $R$ is orthogonal — curiousgeorge, Aug 31 '20 at 00:22
Ah I see it's because $Cov(X)$ is identity not just diagonal — curiousgeorge, Aug 31 '20 at 00:29

bayerj · Answer 2 · 2014-10-01T19:59:08.320

25

Given an Eigendecomposition of a covariance matrix $$ \bar{X}\bar{X}^T = LDL^T $$ where $D = \text{diag}(\lambda_1, \lambda_2, \dots, \lambda_n)$ is the diagonal matrix of Eigenvalues, ordinary whitening resorts to transforming the data into a space where the covariance matrix is diagonal: $$\sqrt{D^{-1}}L^{-1}\bar{X}\bar{X}^TL^{-T}\sqrt{D^{-1}} = \sqrt{D^{-1}}L^{-1}LDL^TL^{-T}\sqrt{D^{-1}} \\ = \mathbf{I} $$ (with some abuse of notation.) That means we can diagonalize the covariance by transforming the data according to $$ \tilde{X} = \sqrt{D^{-1}}L^{-1}X. $$

This is ordinary whitening with PCA. Now, ZCA does something different--it adds a small epsilon to the Eigenvalues and transforms the data back. $$ \tilde{X} = L\sqrt{(D + \epsilon)^{-1}}L^{-1}X. $$ Here are some pictures from the CIFAR data set before and after ZCA.

Before ZCA:

before ZCA

After ZCA with $\epsilon = 0.0001$

after ZCA 1e-4

After ZCA with $\epsilon = 0.1$

after ZCA with .1

For vision data, high frequency data will typically reside in the space spanned by the lower Eigenvalues. Hence ZCA is a way to strengthen these, leading to more visible edges etc.

edited Oct 01 '14 at 19:59

answered Oct 01 '14 at 12:35

bayerj

12,735
3
35
56

1

Shouldn't the epsilon be added before taking inverse? I think it's simply added to stabilize the inversion in case of near-zero eigenvalues. So actually if it makes sense to add it for ZCA whitening, then it would make sense to add it for PCA whitening as well. – amoeba Oct 01 '14 at 16:19
Yes, before the inverse, thanks. Since this is typically done with SVD in practice, I don't know whether stabilizing the inversion is necessary at all. – bayerj Oct 01 '14 at 18:27
I have added another picture to show the effect. – bayerj Oct 01 '14 at 18:35
2

+1, but I have a number of further nitpicks and questions. (1) What I meant about epsilon is that it is not specific to ZCA, it can be used for PCA whitening as well. (2) I am not sure I understand your comment about SVD: SVD or not, one needs to invert singular values, hence the need for epsilon. (3) PCA whitening transformation is $D^{-1/2}L^\top$, you wrote it the other way round, and this makes the computation in the second formula wrong... (4) Nice figures, where are they from? (5) Do you know in which situations would ZCA whitening be preferable to PCA whitening, and why? – amoeba Oct 01 '14 at 19:31
(1) agreed. I have no intuition about what that means, though. (2) My decomposition knowledge is incomplete here, but I assumed that a classical inversion matrix on a singular covariance matrix will fail, while SVD on a data matrix giving rise to a singular covariance will not. (3) Thanks, will fix it. (4) From my code :) (5) I hypothesize that for many algorithms that give overcomplete representations (e.g. GainShape K-Means, Auto encoders, RICA) and/or do a similar job like PCA algebraic indepence of the features hurts, but I have no hard knowledge about this. – bayerj Oct 01 '14 at 19:56
(1-2) Think about a matrix with one very small singular value. Its covariance matrix will have one very small eigenvalue. Whether you use SVD on X or EIG on XX', this value will have to be inverted -- largely amplifying the noise. I am pretty sure that this is the rationale behind epsilon. (4) Nice :) (5) Not sure I understand: what do you mean by "algebraic independence of the features" and why would PCA whitening result in it and ZCA whitening not? I have one other hypothesis though, I have just updated my answer with it. – amoeba Oct 01 '14 at 21:27
(1-2) Thanks! (5) Algebraic independence of features is different to statistical independece of variance, i.e. $w_i^Tw_j = 0$. Basically orthogonal, so what PCA does. Since you brought up the topic of deep nets: dropout is also known to work extremly bad on PCA'ed data--some feature redundancy is necessary. – bayerj Oct 02 '14 at 06:47
Sounds interesting, but I still do not get it: whether you use PCA or ZCA to whiten the data, it is going to be whitened in the end! Meaning that features will have zero correlation, i.e. no feature redundancy. If you say that features can be non-independent even if they have zero correlation, then it is of course correct, but I don't see how PCA and ZCA whitening differ in terms of nonlinear dependencies between features. After all, ZCA whitening is just a rotation of PCA, nothing more! – amoeba Oct 02 '14 at 09:22
The regularization of denoising/contractive auto encoders and dropout is dependent on the axes of the input. Thus if you perturb the main principal component, a network with injected noise will not be able to recover from that since that info is missing everywhere else. These algorithms are robust towards noise that is independent in each component by chosing to look at many different of them which encode the same information, but with differing noise. This assumption is valid in many sensory fields, such as audio and vision. – bayerj Oct 02 '14 at 09:51

What is the difference between ZCA whitening and PCA whitening?

2 Answers2

Linked