Why do deep learning practitioners forego PCA for ZCA?

Question

I have an understanding of PCA and ZCA, read a similar question on the subject which, unfortunately, does not have the specific answer to my question.

I understand the benefits of data whitening: specifically, standardizing the dynamic range of each data feature, which is very important when using stochastic gradient descent. What I fail to understand is opting to use ZCA and foregoing the benefit of having de-correlate your features.

I understand that it is more appealing to the human eye, but aren't we making the job of generalizing the data harder for the learning algorithm?

"foregoing the benefit of having de-correlate your features" -- what do you mean? ZCA de-corelates the features. — amoeba, May 16 '16 at 09:24
Regarding the benefits of ZCA over PCA for deep learning, have you read the last paragraph of my answer in the linked question? — amoeba, May 16 '16 at 09:26
What is ZCA, what is SDG? can you explain (as an edit to the post). — kjetil b halvorsen, May 16 '16 at 14:21
@kjetilbhalvorsen please see [this link](http://stats.stackexchange.com/questions/117427/what-is-the-difference-between-zca-whitening-and-pca-whitening) to an explanation on ZCA and comparison to PCA. SGD is [stochastic gradient decent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent). a principal parameter in the algorithm is the learning rate and having a whitened data can greatly benefit the algorithm. — rhadar, May 16 '16 at 14:55
Thanks, but posts should be selfexplaing so please add this information to the post! — kjetil b halvorsen, May 16 '16 at 15:14
@kjetilbhalvorsen per your request, I have added a link to the original post explaining ZCA. I have a hunch that most people related to deep learning know about SGD and I want to keep the post short as well. — rhadar, May 16 '16 at 15:58
Actually, "standardizing the dynamic range of each data feature" is NOT the reason to whiten; if the goal is to standardize each feature one can simply standardize each feature. Whitening does much more. — amoeba, May 16 '16 at 16:35
@amoeba You are correct again, whitening standardizes the dynamic range of the feature space and not a single feature by its own. Is there any other goal I am missing? — rhadar, May 17 '16 at 07:19

conjectures · Accepted Answer · 2016-05-16T09:37:10.223

5

A big benefit for ZCA is that whitened data is still a picture in the same space as the original. If you ZCA whiten a photo of a cat, it's still cat-like. This is helpful for other techniques searching for nonlinear structure. You're able to take an $n\times n$ patch from a picture and apply a filter to it with the belief that the pixels will exhibit certain useful dependencies by virtue of being neighbours. E.g. is there an eye in this patch? Is there fur in this patch? The same is emphatically not true of PCA, which completely disregards the spatial structure of images.

Second, contrary to your statement, ZCA does decorrelate data. The development in Bell 1997 - equations 5 and 8 - makes this a requirement of the technique. Take the covariance matrix $\bf \Sigma$ and use eigendecomposition to form the whitening matrix $\bf W_z = U D^{-1/2} U^T$. Then for some new $\bf x$ drawn from the distribution we have $Cov(\bf W_zx, \bf W_zx $$)=\bf W_z$$Cov(\bf x,x$)$\bf W_z^T= I$.

edited May 16 '16 at 09:37

answered May 16 '16 at 09:30

conjectures

3,971
19
36

I agree with your answer from 2nd sentence onwards, but I don't understand the 1st one: what does "in the same space as the original" mean? – amoeba May 16 '16 at 09:35
Take an vector in $R^N$ and ZCA whiten it. The result is a vector in $R^N$. The same is not true for PCA. While you could find vectors in $R^N$ that make bad pictures, you're going to have even more trouble if you can't even look at the vector as a picture. – conjectures May 16 '16 at 09:39
What do you mean it is not true in PCA? Of course it is. What *is* the result of PCA-whitening in your opinion then? – amoeba May 16 '16 at 09:39
Look at eq 7 of the Bell paper. If you drop any of the PCs, you don't get back a vector in the original space. – conjectures May 16 '16 at 09:50
Yes, sure, but if I don't drop any then I do. – amoeba May 16 '16 at 09:55
I read both your answer and amoeba (including the excellent answer he gave in the first post i linked to). What I do not understand is that once you perform another rotation on the data after PCA whitening, the eigen-vectors shouldn't be be orthogonal anymore. Thats my main issue. – rhadar May 16 '16 at 09:56
I understand now. ZCA allows you to get relatively the same transform for different image patches when you evaluate the ZCA transform each time from the data under test. I would never imagine to use PCA in this manner. In my original example I would expect that when using PCA, I would save the original transform and apply it to the new patch. otherwise, there is no doubt that the transform would be invalid. So given that, I understand some of the reasoning, I still have a problem with the feature de-correlation which I do not understand. – rhadar May 16 '16 at 10:09
@user2324712 correlation is a property of the covariance matrix that can be removed (in practice approximately) applying a whitening matrix to observation vectors. Because there are many ways to factorise $\Sigma = RR^T$ we have a choice over how to set a whitening matrix. – conjectures May 16 '16 at 10:16
OK. So it is 'locally' uncorrelated. But here is the main issue: when you evaluate the transform on the training data it there is alot of data. This assures you that your basic feature extraction is normalized by maximum separation by variance. This is an optimal contrast stretch that assures good separation between clusters when trying to whiten the dataset in contrast to whitening a single image. Wouldn't using the PCA transform from the original data be superior to using ZCA derived from the image under test? – rhadar May 16 '16 at 10:26
1

1) You're making an assumption about what would or wouldn't be good in practice. Deep learners are reputed to try everything under the sun to see if it works. I put it to you that if PCA worked better in practice, they'd be doing PCA. – conjectures May 16 '16 at 10:29
2

2) You're also assuming that separation by variance is the best way to actually separate data. If you consider the performance of support vector machines with nonlinear kernels versus linear ones, you'll see this is not a safe assumption to make. – conjectures May 16 '16 at 10:34
2

@user2324712, to your line `perform another rotation on the data after PCA whitening, the eigen-vectors shouldn't be be orthogonal anymore`. Although it's not very clear to me what you call "eigenvectors" after additional orthogonal rotation, - let me recommend an [answer](http://stats.stackexchange.com/a/193023/3277) with a chart, about rotation after PCA. "Standardized PC scores" is a synonym to "PCA-whitened data". One should not mix up, when speaking of orthogonality, rotated PC data and corresponding rotated loadings. – ttnphns May 16 '16 at 10:36
@user2324712: I think what you wrote after "I understand now" is false. ZCA is not derived "from the image under test". I think you are still confused. – amoeba May 16 '16 at 10:40
You are correct. I looked at how ZCA is implemented in Keras for example. I see that it can be derived from the dataset and not the image. My point is still the same. ZCA has a good advantage that it keeps the data natural. So I would expect ZCA on the train and the test dataset to yield pretty similar transforms so your features would be usable across datasets. But is it optimal? Usually you have alot less test data. Wouldn't we be able to squeeze another 1% in performance if we used the original transform? – rhadar May 16 '16 at 10:52
thank you for [this answer](http://stats.stackexchange.com/questions/612/is-pca-followed-by-a-rotation-such-as-varimax-still-pca/193023#193023) i'll read it and return to you when I am wiser. – rhadar May 16 '16 at 10:53

Why do deep learning practitioners forego PCA for ZCA?

1 Answers1