Work around restrictions of Scikit-Learns PCA implementation

Question

Scikit-Learns implementation of Principal Component Analysis has some restrictions, that are based on the svd_solver (link to docs). This means, that if i have a Matrix of size $1000 \times 10000$ (1000 samples of data with 10000 dimensions each), the maximum size of retained principal components will be $1000$, even though the samples are of much higher dimensionality. The same holds true for OpenCVs implementation, even though i haven't found the restriction documented anywhere.

Is there an implementation, that doesn't have this issue? I don't understand PCA well enough (yet), so if there is no alternative implementation due to mathematical restrictions, can someone explain those?

Can i work around this issue by simply repeating the data on the first axis or would this alter the outcome of the PCA?

EDIT

I tried out the above mentioned workaround of repeating the data for OpenCV and Scikit-Learn. Interestingly the retained eigenvectors are very similar when using OpenCV (although np.allclose still yields False when comparing them), which is not the case for Scikit-Learn. Here is the code i used:

In [1]: import numpy as np

In [2]: from sklearn.decomposition import PCA

In [3]: data = np.random.randn(1000, 10000)

In [4]: pca = PCA(n_components=512).fit(data)

In [5]: eigvecs1 = pca.components_

In [6]: eigvecs1.shape
Out[6]: (512, 10000)

In [7]: pca = PCA(n_components=512).fit(np.repeat(data, 2, axis=0))

In [8]: eigvecs2 = pca.components_

In [9]: eigvecs1
Out[9]:
array([[-0.0114526 , -0.00996697,  0.00557914, ..., -0.012846  ,
         0.00425294,  0.00691419],
       [-0.01449022, -0.00047784, -0.00998881, ...,  0.012241  ,
        -0.01020919, -0.01710263],
       [ 0.01272314,  0.00233765,  0.00018211, ..., -0.01238945,
        -0.01731416,  0.00253287],
       ...,
       [-0.00130884,  0.00617238,  0.00793167, ...,  0.01057314,
        -0.0007045 , -0.00240435],
       [ 0.00266372,  0.01544145, -0.01423845, ...,  0.01398243,
         0.01479688,  0.00073665],
       [-0.00920773,  0.01493651,  0.00458802, ..., -0.00557622,
         0.0120589 ,  0.00136536]])

In [10]: eigvecs2
Out[10]:
array([[-0.01423556, -0.01103274,  0.00442386, ..., -0.01262764,
         0.00506506,  0.00525846],
       [-0.01257356, -0.00062029, -0.00961628, ...,  0.01013355,
        -0.01211543, -0.01631519],
       [ 0.01252861,  0.00128116,  0.00278262, ..., -0.01210388,
        -0.01808777,  0.00620132],
       ...,
       [ 0.00043966,  0.00897022,  0.01418632, ..., -0.00396078,
         0.00484379,  0.00381486],
       [ 0.01256598, -0.00470218,  0.0174601 , ..., -0.00338207,
         0.00441305,  0.01918609],
       [ 0.00619724, -0.00571119,  0.01597917, ...,  0.00635742,
        -0.00689069,  0.0040474 ]])

In [11]: _mean = np.empty((0))

In [13]: import cv2

In [14]: _, eigvecs1, _ = cv2.PCACompute2(data, _mean, maxComponents=512)

In [15]: _, eigvecs2, _ = cv2.PCACompute2(np.repeat(data, 2, axis=0), _mean, maxComponents=512)

In [16]: eigvecs1
Out[16]:
array([[-0.01449062, -0.01040097,  0.00533161, ..., -0.01245524,
         0.00511147,  0.00628998],
       [-0.01323496, -0.00148277, -0.00934135, ...,  0.01211122,
        -0.01002037, -0.01698735],
       [ 0.01282049,  0.002607  ,  0.00122367, ..., -0.01268594,
        -0.01740706,  0.00474433],
       ...,
       [-0.00427259,  0.00604704, -0.00435322, ...,  0.00811409,
         0.01581502,  0.00690567],
       [-0.00293716,  0.00059952,  0.01286799, ..., -0.01280625,
        -0.0195555 ,  0.0114952 ],
       [ 0.015998  , -0.00172532, -0.01849722, ...,  0.00819871,
        -0.0029199 ,  0.01329176]])

In [17]: eigvecs2
Out[17]:
array([[-0.01449062, -0.01040097,  0.00533161, ..., -0.01245524,
         0.00511147,  0.00628998],
       [-0.01323496, -0.00148277, -0.00934135, ...,  0.01211122,
        -0.01002037, -0.01698735],
       [ 0.01282049,  0.002607  ,  0.00122367, ..., -0.01268594,
        -0.01740706,  0.00474433],
       ...,
       [-0.00427259,  0.00604704, -0.00435322, ...,  0.00811409,
         0.01581502,  0.00690567],
       [-0.00293716,  0.00059952,  0.01286799, ..., -0.01280625,
        -0.0195555 ,  0.0114952 ],
       [-0.015998  ,  0.00172532,  0.01849722, ..., -0.00819871,
         0.0029199 , -0.01329176]])

In [18]: np.allclose(eigvecs1, eigvecs2)
Out[18]: False

Also of interest: https://stats.stackexchange.com/questions/46475/is-pca-appropriate-when-np — Sycorax, Dec 04 '20 at 00:59

Sycorax · Accepted Answer · 2020-12-04T00:52:34.110

2

The rank of a matrix is defined as the dimension of the column space of a matrix. Therefore, an $n \times n$ matrix has, at most, rank $n$. If you perform an SVD on a $n \times n$ matrix, you'll have at most $n$ positive singular values. So the observation of having at most $n$ PCs is a property of matrices.

If you instead mean that you have an $n \times m$ matrix with $n \le m$, then the rank of the matrix is at most $\min\{n,m\}$. Again, this is a property of linear algebra, not a limitation of any software package or implementation.

The code snippet is addressed in some detail in PCA: Eigenvectors of opposite sign and not being able to compute eigenvectors with `solve` in R

To summarize, the characteristic equation can be rescaled by a positive or negative number, yielding any number of eigenvectors of different lengths and/or opposite directions. It's typical to choose eigenvectors of unit length, so that resolves the "length" ambiguity. However, the sign ambiguity remains; that is, if $x$ is an eigenvector to some matrix, then $-x$ is also an eigenvector of that matrix.

In light of this, it's not clear to me what the "repeat data" approach is intended to reveal.

edited Dec 04 '20 at 00:52

answered Dec 04 '20 at 00:08

Sycorax

76,417
20
189
313

It seems like the covariance matrix for $1000$ observations of $10000$ variables would be $10000\times 10000$, so we should be able to get $10000$ eigenvectors and eigenvalues. That there are only $1000$ observations seems irrelevant once we get the covariance matrix, I would think. – Dave Dec 04 '20 at 00:11
@Dave One might think that, but it turns out that the rank of a matrix is still the dimension of the column space. The column space of a matrix is *at most* the number of columns, not exactly the number of columns. Constructing the covariance matrix does not increase the rank of the data matrix. – Sycorax Dec 04 '20 at 00:26
Thanks for the answer. I edited the question to include some code where i tried out the "repeat the data"-approach. Can you comment on this idea and the findings above? – Tim Hilt Dec 04 '20 at 00:32
I probably knew at one point why this happens, but I just surprised the heck out of myself by calculating the covariance matrix of $5$ observations of $7$ variables and finding that it is singular. – Dave Dec 04 '20 at 00:33
@TimHilt I've expanded my answer. – Sycorax Dec 04 '20 at 00:45
Thanks for that @Sycorax. To shed light on the last paragraph: Say i have a matrix of size $1000 \times 10000$, where each row represents a sample. If i would try to reduce each sample to a dimensionality of $1024$, i couldn't use PCA. Wheras if i repeat the matrix along the row-dimension (to retain a matrix of size $2000 \times 10000$ i could reduce it to $1024$ elements. – Tim Hilt Dec 04 '20 at 07:48
The repeated rows don’t increase the rank of the matrix, so the corresponding vectors add zero variance explained. In other words, you don’t have any new information. The only way to get data about the additional PCs is to collect more data. – Sycorax Dec 04 '20 at 14:12

Work around restrictions of Scikit-Learns PCA implementation

1 Answers1