Optimal compressibility and PCA

Question

I have a population $\mathcal{X}$ of $N$ samples extracted from a multivariate gaussian random variable $\mathbf{x} \in \mathbb{R}^d$. Let us define a transformation $f_{d\rightarrow r} (\mathbf{x}) = \mathbf{x'}$ which performs a dimensionality reduction to a space $\mathbb{R}^r$ with $r < d$, and a "reverse" transformation $g_{r \rightarrow d}(\mathbf{x'}) = \mathbf{\tilde x}$ (as pointed out in the comments, not an actual inverse) that projects the data back in the original space. I also define PCA$_{d\rightarrow r}(\mathbf{x})$ as the projection in the space of the top $r$ eigenvectors of the (sample-estimated) covariance matrix, with corresponding reverse transformation PCA$^{-1}_{r \rightarrow d}(z)$. Then, is PCA$_{d\rightarrow r}(\mathbf{x})$ optimal (in terms of average MSE of reconstruction) in the space of (possibly nonlinear) dimensionality reduction functions? I.e.

$$ \text{argmin}_{f_{d\rightarrow r},\,g_{r\rightarrow d}} \mathbb{E}_\mathcal{X} \left[ \sum_{x_i \in \mathcal{X}}|| g_{r\rightarrow d}\left(f_{d\rightarrow r}(x_i)\right) - x_i ||^2 \right] = \text{PCA}_{d\rightarrow r}(\cdot), \, \text{PCA}^{-1}_{r \rightarrow d}(\cdot) \text{ ?}$$

While I realize this is not true for a generic random variable, I have an intuition that it could be the case for when the input data is normally distributed, since the first and second moments define all the relevant information necessary to compress the data. I am also aware of the fact that PCA would be optimal in the space of linear projections, but I am interested in the wider space of the nonlinear transformations (e.g. autoencoders). However I could not find any explicit solution to this question anywhere.

EDIT

I realized it will be easier to provide some context on the practical problem that made me ask this question. I am working with autoencoders for dimensionality reduction, and I noticed that for some kinds of data the reconstruction error (even in the training set) goes never below the reconstruction error of PCA (with same embedding dimension), up to negligible fluctuations. This happens regardless of how I change the architecture or hyperparameters. I began suspecting that it was due to the data having only first order correlations (similar to a multivariate gaussian distributed variable), for which there could be some fundamental limit of compressibility. In other words, is PCA an actual lower bound for the reconstruction error of data when reducing the dimensions from $d$ to $r$ components? Any hint on this would be appreciated.

No. PCA is linear. Is optimal among linear, and optimization of least squares is one way to estimate PCA — Aksakal, Oct 19 '21 at 02:40
Yes, but in this case I'm specifically interested in performance as measured by average MSE. So to rephrase, would it be possible to show what's the best possible compression transformation we can have at r dimensions, in terms of average MSE, given normal input data with unknown covariance? I feel that normality of data would be a sufficient constraint to have a closed solution, but I haven't been able to find it yet. — David Shor, Oct 19 '21 at 15:13
Covariance itself is a linear measure, and has a strong connection to Gaussian as such. I don’t think it’s possible to find the best compression in general case. When you put constraints such as linearity then you get to things like PCA — Aksakal, Oct 19 '21 at 15:16
@ChristianHennig I am not sure how to put this in a well-posed form, but I would be assuming that $f_r(x)$ is a bijective function from space in $D$ dimensions to space in $r$ dimensions, so it would have an inverse. Probably it will be easier to explain the practical problem that generated this question. I am updating the question to ask this, thanks for the help. — David Shor, Oct 23 '21 at 21:57
@DavidShor But if $f_r$ is bijective and has an inverse, $f_r^{-1}(f_r(x_i))=x_i$ and your loss would be zero, no? — Christian Hennig, Oct 23 '21 at 22:32
You should replace $f_r^{-1}$ by another function $g$, since $f_r$ will not be invertible. Then we have $g(x')=\mathbb E[x|f_r(x)=x']$ in order to minimize MSE. Then we should find $f_r$ that minimizes the MSE functional and check whether it gives us PCA. I wouldn't know how to solve this though, maybe some variational calculus trick? — PedroSebe, Oct 23 '21 at 23:30
@ChristianHennig you are right, sorry, it doesn't make sense to define it as inverse of $f_r$. I replaced it with another function and clarified the notation a bit. — David Shor, Oct 30 '21 at 20:06
@PedroSebe you are also right, thanks, I updated the notation. — David Shor, Oct 30 '21 at 20:06

Optimal compressibility and PCA

0 Answers0