Scikit-Learns implementation of Principal Component Analysis has some restrictions, that are based on the svd_solver
(link to docs). This means, that if i have a Matrix of size $1000 \times 10000$ (1000 samples of data with 10000 dimensions each), the maximum size of retained principal components will be $1000$, even though the samples are of much higher dimensionality. The same holds true for OpenCVs implementation, even though i haven't found the restriction documented anywhere.
Is there an implementation, that doesn't have this issue? I don't understand PCA well enough (yet), so if there is no alternative implementation due to mathematical restrictions, can someone explain those?
Can i work around this issue by simply repeating the data on the first axis or would this alter the outcome of the PCA?
EDIT
I tried out the above mentioned workaround of repeating the data for OpenCV and Scikit-Learn. Interestingly the retained eigenvectors are very similar when using OpenCV (although np.allclose
still yields False
when comparing them), which is not the case for Scikit-Learn. Here is the code i used:
In [1]: import numpy as np
In [2]: from sklearn.decomposition import PCA
In [3]: data = np.random.randn(1000, 10000)
In [4]: pca = PCA(n_components=512).fit(data)
In [5]: eigvecs1 = pca.components_
In [6]: eigvecs1.shape
Out[6]: (512, 10000)
In [7]: pca = PCA(n_components=512).fit(np.repeat(data, 2, axis=0))
In [8]: eigvecs2 = pca.components_
In [9]: eigvecs1
Out[9]:
array([[-0.0114526 , -0.00996697, 0.00557914, ..., -0.012846 ,
0.00425294, 0.00691419],
[-0.01449022, -0.00047784, -0.00998881, ..., 0.012241 ,
-0.01020919, -0.01710263],
[ 0.01272314, 0.00233765, 0.00018211, ..., -0.01238945,
-0.01731416, 0.00253287],
...,
[-0.00130884, 0.00617238, 0.00793167, ..., 0.01057314,
-0.0007045 , -0.00240435],
[ 0.00266372, 0.01544145, -0.01423845, ..., 0.01398243,
0.01479688, 0.00073665],
[-0.00920773, 0.01493651, 0.00458802, ..., -0.00557622,
0.0120589 , 0.00136536]])
In [10]: eigvecs2
Out[10]:
array([[-0.01423556, -0.01103274, 0.00442386, ..., -0.01262764,
0.00506506, 0.00525846],
[-0.01257356, -0.00062029, -0.00961628, ..., 0.01013355,
-0.01211543, -0.01631519],
[ 0.01252861, 0.00128116, 0.00278262, ..., -0.01210388,
-0.01808777, 0.00620132],
...,
[ 0.00043966, 0.00897022, 0.01418632, ..., -0.00396078,
0.00484379, 0.00381486],
[ 0.01256598, -0.00470218, 0.0174601 , ..., -0.00338207,
0.00441305, 0.01918609],
[ 0.00619724, -0.00571119, 0.01597917, ..., 0.00635742,
-0.00689069, 0.0040474 ]])
In [11]: _mean = np.empty((0))
In [13]: import cv2
In [14]: _, eigvecs1, _ = cv2.PCACompute2(data, _mean, maxComponents=512)
In [15]: _, eigvecs2, _ = cv2.PCACompute2(np.repeat(data, 2, axis=0), _mean, maxComponents=512)
In [16]: eigvecs1
Out[16]:
array([[-0.01449062, -0.01040097, 0.00533161, ..., -0.01245524,
0.00511147, 0.00628998],
[-0.01323496, -0.00148277, -0.00934135, ..., 0.01211122,
-0.01002037, -0.01698735],
[ 0.01282049, 0.002607 , 0.00122367, ..., -0.01268594,
-0.01740706, 0.00474433],
...,
[-0.00427259, 0.00604704, -0.00435322, ..., 0.00811409,
0.01581502, 0.00690567],
[-0.00293716, 0.00059952, 0.01286799, ..., -0.01280625,
-0.0195555 , 0.0114952 ],
[ 0.015998 , -0.00172532, -0.01849722, ..., 0.00819871,
-0.0029199 , 0.01329176]])
In [17]: eigvecs2
Out[17]:
array([[-0.01449062, -0.01040097, 0.00533161, ..., -0.01245524,
0.00511147, 0.00628998],
[-0.01323496, -0.00148277, -0.00934135, ..., 0.01211122,
-0.01002037, -0.01698735],
[ 0.01282049, 0.002607 , 0.00122367, ..., -0.01268594,
-0.01740706, 0.00474433],
...,
[-0.00427259, 0.00604704, -0.00435322, ..., 0.00811409,
0.01581502, 0.00690567],
[-0.00293716, 0.00059952, 0.01286799, ..., -0.01280625,
-0.0195555 , 0.0114952 ],
[-0.015998 , 0.00172532, 0.01849722, ..., -0.00819871,
0.0029199 , -0.01329176]])
In [18]: np.allclose(eigvecs1, eigvecs2)
Out[18]: False