I have an understanding of PCA and ZCA, read a similar question on the subject which, unfortunately, does not have the specific answer to my question.
I understand the benefits of data whitening: specifically, standardizing the dynamic range of each data feature, which is very important when using stochastic gradient descent. What I fail to understand is opting to use ZCA and foregoing the benefit of having de-correlate your features.
I understand that it is more appealing to the human eye, but aren't we making the job of generalizing the data harder for the learning algorithm?